Engineering Manager, Site Reliability Engineering

Nayya•New York, NY

56d

About The Position

We are looking for a passionate and driven Engineering Manager, Site Reliability Engineering to lead our SRE team at Nayya. In this role, you will combine strong technical expertise with people leadership to build a high-performing team that ensures the reliability, scalability, and performance of our platform. This is a hands-on leadership role where you will actively contribute to design, write code, and engage directly in incident response alongside your team. As an Engineering Manager at a fast-paced, growth-stage startup, you will be a key partner to engineering, product, and data leadership - setting the technical direction for infrastructure and operations while developing the people and processes that make it all work. We are seeking a leader who thrives in an environment that prioritizes impatience, excellence, resilience, and courage - someone who is excited about leading teams that make an immediate impact while pushing the boundaries of what’s possible. You will own the roadmap for reliability and infrastructure, drive strategic decisions, and foster a culture of collaboration, continuous improvement, and technical excellence across the organization.

Requirements

7+ years of professional experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or related roles, with at least 2 years in a people management or team lead capacity.
Proven track record of leading, and scaling engineering teams in a fast-paced startup or growth-stage environment.
Strong technical background with hands-on experience building and maintaining high-performance, scalable systems.
Proficiency in at least one modern programming language such as Python, Ruby, Go, JavaScript, or similar.
Extensive experience with AWS, specifically: VPC networking, Route 53, ECS, EKS, Lambda, API Gateway, and RDS (Postgres/Aurora) Data infrastructure provisioning (EMR, Glue, Redshift, Step Functions, Athena)
Deep understanding of infrastructure as code (Terraform preferred) and CI/CD pipelines (GitHub Actions preferred).
Strong knowledge of site reliability practices, incident management, monitoring, and alerting. Familiarity with Datadog or similar observability platform(s) and tooling is a plus.
People-First Leadership: Genuine commitment to developing engineers, creating inclusive teams, and leading with empathy and accountability.
Agility: Ability to adapt to rapidly changing priorities and shifting technical landscapes while keeping your team focused and effective.
Excellence: Commitment to high standards for reliability, performance, scalability, and team culture.
Courage: Willingness to make difficult decisions, have hard conversations, and step up to complex technical and organizational challenges.
Technical Depth: Staying close to the codebase and infrastructure, leading by example through hands-on contributions

Nice To Haves

Experience in a growth-stage startup or similar high-growth company.
Familiarity with microservices, serverless architectures, and cloud-native technologies.
Experience partnering with executive leadership on technical strategy and organizational design.

Responsibilities

Build and lead a high-performing SRE team by hiring, onboarding, and retaining top engineering talent.
Provide regular coaching, mentorship, and career development support to direct reports, helping engineers grow into senior technical and leadership roles.
Conduct meaningful performance reviews, set clear goals, and create individual development plans aligned with team and company objectives.
Foster a team culture rooted in ownership, psychological safety, collaboration, and continuous learning.
Define and drive the SRE roadmap in partnership with engineering, product, and data leadership, ensuring alignment with business priorities.
Directly contribute to the design and implementation of highly available systems while guiding the team's technical approach.
Establish and evolve standards for infrastructure as code, observability, CI/CD, incident management, and performance tuning.
Partner with software engineering teams to embed reliability practices into the software development lifecycle, including SLIs, SLOs, and error budgets.
Serve as the primary point of contact for SRE across the organization, translating technical reliability concepts into business impact for non-technical stakeholders.
Collaborate with product, software engineering, and data teams to define and implement best practices for reliability, performance, and scalability.
Represent the SRE team in planning and prioritization discussions, advocating for infrastructure investments and AI enablement.
Own and continuously improve incident management processes, including on-call rotations, escalation procedures, and blameless postmortems.
Balance rapid delivery with system stability, ensuring reliable deployment pipelines and minimal downtime.
Drive a data-informed approach to reliability by establishing and tracking key metrics, SLIs, and error budgets.
Adapt quickly to evolving business needs and emerging technologies, delivering incremental improvements with a focus on learning and iteration.