Staff Site Reliability Engineer

Altana •New York, NY

133d•$170,000 - $220,000

About The Position

At Altana, we believe that software that ships must be reliable and efficient. As a Staff Site Reliability Engineer, you will be instrumental in ensuring the availability, performance, and scalability of Altana’s critical production services, with a strong focus on our cloud-native environments and data pipelines. You will apply Google-style SRE principles, embedding reliability into our architecture and operations through automation, proactive monitoring, and a commitment to reducing toil. You will work hands-on with engineering teams, influencing system design for operability and contributing to the development of robust, self-healing infrastructure. This role emphasizes a deep understanding of observability practices to gain comprehensive insights into system behavior, proactive incident prevention, and efficient incident response. Success will be measured by the resilience of our production systems, the effectiveness of our observability stack, and our continuous improvement in operational efficiency and reliability.

Requirements

5+ years of hands-on experience in a Site Reliability Engineering (SRE), DevOps, or equivalent role focusing on production system reliability and operations.
Strong understanding and practical application of Site Reliability Engineering (SRE) principles, including SLOs, error budgets, toil reduction, and blameless culture.
Expertise in designing, implementing, and managing observability platforms for cloud-native environments.
Proficiency in at least one programming/scripting language (e.g., Python, Go) for automation and tool development.
Extensive hands-on experience with cloud platforms (AWS, Azure, or GCP), including their compute, networking, and database services.
Demonstrated experience with containerization technologies (Docker) and container orchestration platforms (Kubernetes).
Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, OpenTofu, CloudFormation) for managing cloud resources.
Proven experience participating in and improving incident management processes for critical systems.
Knowledge of modern software delivery paradigms, including microservices architectures and CI/CD pipelines.
Excellent problem-solving, analytical, and troubleshooting skills in complex distributed systems.
Strong communication and collaboration skills, with the ability to work effectively across engineering teams.
Experience with data engineering concepts, including building or operating reliable data pipelines, data streaming technologies, or managing large-scale data infrastructure.

Responsibilities

Champion and implement SRE principles, including establishing and monitoring Service Level Objectives (SLOs) and error budgets for critical services.
Drive initiatives to improve system reliability, availability, performance, and efficiency.
Design, implement, and maintain advanced monitoring, logging, and tracing solutions for our cloud-native applications and infrastructure.
Develop dashboards, alerts, and runbooks that provide deep insights into system health and behavior.
Identify and automate repetitive operational tasks and manual processes across our production environment.
Develop tools and scripts to enhance system operations, deployment pipelines, and incident response.
Actively participate in the incident response lifecycle, including detection, triage, mitigation, and resolution of production issues.
Lead thorough blameless postmortems to identify root causes and implement preventative measures and lasting improvements.
Collaborate closely with development teams to influence the design of new services, ensuring they are built for operability, reliability, and cost-efficiency.
Proactively identify and address performance bottlenecks and architectural weaknesses.
Participate in a periodic on-call rotation, responding to critical alerts and ensuring rapid resolution of production incidents.
Implement and maintain reliability and observability for critical data pipelines and data infrastructure.

Benefits

Flexible Time Off: Altana operates with a Flexible Time Off (FTO) policy that gives you agency over your own time off.
Parental Leave: 14 weeks of leave for non-birthing, adoptive, and foster parents and up to 26 weeks for birthing parents, all paid at 100% of your base salary.
Health Benefits: Full suite of medical, vision, and dental benefits with generous employer contributions.
Supplemental Benefits: Life, short- and long-term disability, and AD&D insurance coverage at no cost.
401(k) Savings: Guideline 401(k) retirement savings program.
Commuter Benefits: Pre-tax funds for public transit or parking.
Wellness: Free premium subscription to Calm, the #1 app for meditation, sleep, and mindfulness.
Pet Insurance: Keep pets healthy with Wishbone insurance and/or Total Pet vet service and telehealth discount plan.
Employee Assistance Program: Free access to confidential personal support.
Dependent Care FSA: Set aside pre-tax funds for childcare expenses.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

Bachelor's degree

Number of Employees

101-250 employees

Staff Site Reliability Engineer

About The Position

Requirements

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company