We are looking to hire a Site Reliability Engineer who will help in building and maintaining the observability platform across multiple business lines, helping to establish observability best practices. What you'll be doing: Build and improve observability and reliability solutions that help engineering teams operate and support their services with confidence. Partner with engineering teams to design monitoring, alerting, dashboards, and service health standards early in the software delivery lifecycle. Write and maintain code, infrastructure definitions, and automation that reduce manual work and improve reliability. Help engineers instrument services and systems so teams can quickly detect, diagnose, and resolve issues. Support the adoption and standardization of telemetry patterns across metrics, logs, and traces, including OpenTelemetry-based instrumentation where appropriate. Improve the reliability of our AWS and Kubernetes environments, including EKS, through durable engineering solutions rather than repetitive operational work. Participate in incident response and follow-up activities, including troubleshooting, root cause analysis, and the implementation of lasting fixes. Identify opportunities to reduce toil and improve the developer experience through automation, reusable patterns, and better engineering practices. Continuously evaluate our tooling, reliability practices, and engineering processes for opportunities to improve.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed