The APM Root Cause Analysis team's mission is to help engineers and SREs respond rapidly and effectively to incidents affecting their production systems. During an incident, one of the first questions a responder asks is, 'What is the change that caused the incident?'-and that's exactly what this team aims to answer. To answer that question, the team is building several impactful systems: A platform to ingest interesting changes from across our customer environments (Deployments, DB changes, Feature Flag changes, K8s changes, etc.), a system to process past incidents in our environment and label the faulty changes that led to the incidents, enabling us to build a high quality evaluation dataset for faulty change detection, a system that uses LLM, ML, and statistical models to assess whether a specific change is the cause of an incident, and a product experience to expose those faulty changes in strategic locations in the product in a way that aids incident response and reduces MTTR. As a manager, you will play an active role in shaping the roadmap for automated root cause analysis through collaboration with multiple stakeholder teams. You will have a deep and immediate impact in guiding the product through your design and engineering decisions.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Manager
Industry
Publishing Industries
Education Level
Master's degree
Number of Employees
1,001-5,000 employees