Staff SRE

Pura•Pleasant Grove, UT

About The Position

In this high-impact staff-level role, you will architect, design, and implement enterprise-scale infrastructure solutions supporting Web, Mobile, Backend, and Data engineering teams, while providing technical leadership across cross-functional groups. You will define and drive adoption of reliability standards, architectural patterns, and engineering best practices across the organization, working closely with engineering and security leadership. You will lead performance optimization initiatives, implementing sophisticated monitoring strategies and leveraging advanced analytics to ensure exceptional system reliability and performance at scale. You will design and implement comprehensive automation frameworks for infrastructure provisioning, configuration management, and deployment processes, focusing on efficiency and scalability. You will serve as the technical authority for incident management, establishing robust incident response frameworks, leading cross-functional response efforts, and driving systematic improvements through detailed post-incident analysis. You will architect and implement enterprise-wide incident response strategies, including sophisticated playbooks and multi-tier escalation procedures aligned with business continuity requirements. You will partner with engineering leadership to drive reliability improvements through advanced automated testing frameworks, fault-tolerant architectures, and comprehensive disaster recovery strategies. You will provide technical mentorship and leadership to the broader engineering organization while contributing to the strategic direction of the SRE practice.

Requirements

10+ years of extensive experience as a Site Reliability Engineer or similar role, with a proven track record of architecting solutions for large-scale distributed systems.
Expert-level proficiency in multiple programming languages including Python, Go, or Node.js, with demonstrated experience building complex automation frameworks and infrastructure tools.
Comprehensive mastery of cloud technologies, particularly AWS and GCP, including experience architecting multi-region, highly available systems.
Deep expertise in Kubernetes administration and architecture, including experience operating large-scale clusters, implementing custom controllers, and optimizing cluster performance.
Extensive experience with advanced observability platforms and practices, including implementing custom monitoring solutions and developing sophisticated alerting strategies.
Proven track record of designing and implementing complex IAM architectures for enterprise-scale organizations.
Distinguished expertise in Infrastructure as Code, particularly with Terraform, including experience developing custom providers and managing multi-cloud deployments.
Exceptional problem-solving abilities with demonstrated experience resolving critical production issues in complex, high-stakes environments.

Responsibilities

Architect, design, and implement enterprise-scale infrastructure solutions supporting Web, Mobile, Backend, and Data engineering teams, while providing technical leadership across cross-functional groups.
Define and drive adoption of reliability standards, architectural patterns, and engineering best practices across the organization, working closely with engineering and security leadership.
Lead performance optimization initiatives, implementing sophisticated monitoring strategies and leveraging advanced analytics to ensure exceptional system reliability and performance at scale.
Design and implement comprehensive automation frameworks for infrastructure provisioning, configuration management, and deployment processes, focusing on efficiency and scalability.
Serve as the technical authority for incident management, establishing robust incident response frameworks, leading cross-functional response efforts, and driving systematic improvements through detailed post-incident analysis.
Architect and implement enterprise-wide incident response strategies, including sophisticated playbooks and multi-tier escalation procedures aligned with business continuity requirements.
Partner with engineering leadership to drive reliability improvements through advanced automated testing frameworks, fault-tolerant architectures, and comprehensive disaster recovery strategies.
Provide technical mentorship and leadership to the broader engineering organization while contributing to the strategic direction of the SRE practice.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume