This role offers the opportunity to shape and lead site reliability practices across a large-scale, AI-driven platform, ensuring systems are resilient, observable, and self-healing. You will collaborate with cross-functional teams including Product Engineering, Machine Learning, DevOps, and Development Productivity to influence both technical and operational strategies. The position emphasizes thought leadership, mentoring, and driving adoption of reliability best practices across the organization. You will design and implement frameworks for distributed tracing, real user monitoring, performance metrics, and automation to minimize downtime. This role requires hands-on technical contributions while aligning initiatives with business goals, ultimately improving engineering velocity, operational efficiency, and user experience. Operating in a remote-first environment, you will have the opportunity to lead enterprise-wide reliability initiatives while fostering a culture of operational excellence.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed