Lead the implementation and continuous evolution of Site Reliability Engineering (SRE) practices to ensure exceptional high availability, performance, and scalability for the Ford Service Reservation Platform and its applications. Define, implement, and rigorously maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for key services, directly aligning reliability goals with critical business and customer outcomes. Generate regular SLO and error budget reports, collaborating closely with engineering teams to strategically prioritize reliability work, incident follow-ups, and targeted technical debt reduction efforts. Lead weekly status and reliability reviews, effectively communicating risks, performance trends, and improvement opportunities to key stakeholders in engineering and product. Champion data-driven decision-making, leveraging observability insights to significantly improve incident response, reduce Mean Time to Resolution (MTTR), and enhance the overall customer experience. Cloud & Infrastructure (GCP Focus) GCP Expertise: Deep understanding of Google Cloud Platform services, specifically networking (VPC, Firewalls), Load Balancing, GKE (Google Kubernetes Engine), and IAM. Infrastructure as Code (IaC): Advanced proficiency in Terraform for provisioning cloud resources and managing infrastructure state. Linux/Systems: Strong command of Linux internals and administration. Incident Management: Experience acting as an Incident Commander or leading "Post-Mortem" (Blameless Root Cause Analysis) sessions to prevent recurrence of systemic issues. Data-Driven Mindset: Ability to translate complex observability data into actionable insights for engineering and product stakeholders. Communication: Strong ability to lead weekly reliability reviews and communicate technical risks to non-technical stakeholders.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level