Senior Site Reliability Engineer

FordDearborn, MI
16h

About The Position

Lead the implementation and continuous evolution of Site Reliability Engineering (SRE) practices to ensure exceptional high availability, performance, and scalability for the Ford Service Reservation Platform and its applications. Define, implement, and rigorously maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for key services, directly aligning reliability goals with critical business and customer outcomes. Generate regular SLO and error budget reports, collaborating closely with engineering teams to strategically prioritize reliability work, incident follow-ups, and targeted technical debt reduction efforts. Lead weekly status and reliability reviews, effectively communicating risks, performance trends, and improvement opportunities to key stakeholders in engineering and product. Champion data-driven decision-making, leveraging observability insights to significantly improve incident response, reduce Mean Time to Resolution (MTTR), and enhance the overall customer experience. Cloud & Infrastructure (GCP Focus) GCP Expertise: Deep understanding of Google Cloud Platform services, specifically networking (VPC, Firewalls), Load Balancing, GKE (Google Kubernetes Engine), and IAM. Infrastructure as Code (IaC): Advanced proficiency in Terraform for provisioning cloud resources and managing infrastructure state. Linux/Systems: Strong command of Linux internals and administration. Incident Management: Experience acting as an Incident Commander or leading "Post-Mortem" (Blameless Root Cause Analysis) sessions to prevent recurrence of systemic issues. Data-Driven Mindset: Ability to translate complex observability data into actionable insights for engineering and product stakeholders. Communication: Strong ability to lead weekly reliability reviews and communicate technical risks to non-technical stakeholders.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, Systems Engineering or equivalent combination of relevant education and experience.
  • 7+ years of experience in Software Engineering, DevOps, or Systems Administration.
  • 5+ years of dedicated experience in a Site Reliability Engineering (SRE) or Platform Engineering role.
  • 2+ years of experience leading technical initiatives or mentoring junior engineers in an SRE context.
  • GCP Expertise: Deep understanding of Google Cloud Platform services, specifically networking (VPC, Firewalls), Load Balancing, GKE (Google Kubernetes Engine), and IAM.
  • Infrastructure as Code (IaC): Advanced proficiency in Terraform for provisioning cloud resources and managing infrastructure state.
  • Linux/Systems: Strong command of Linux internals and administration.
  • Incident Management: Experience acting as an Incident Commander or leading "Post-Mortem" (Blameless Root Cause Analysis) sessions to prevent recurrence of systemic issues.
  • Data-Driven Mindset: Ability to translate complex observability data into actionable insights for engineering and product stakeholders.
  • Communication: Strong ability to lead weekly reliability reviews and communicate technical risks to non-technical stakeholders.

Nice To Haves

  • Master's Degree in Computer Science, Computer Engineering, Systems Engineering or related field
  • Google Professional Cloud Architect or Google Professional Cloud DevOps Engineer.
  • Dynatrace Professional Certification.
  • Terraform Associate Certification.
  • Platform Experience: Prior experience working on high-traffic reservation systems, e-commerce platforms, or automotive service applications.

Responsibilities

  • Design and implement robust Google Cloud Platform (GCP) observability patterns for logs, metrics, alerts, and dashboards specifically tailored for the Ford Service Reservation Platform and its associated applications.
  • Develop and deploy infrastructure as code using Terraform scripts for the provisioning and management of GCP resources, including networking, load balancing, and monitoring artifacts etc.
  • Build reusable, scalable Terraform modules to automate the provisioning of GCP monitoring artifacts, including log-based metrics, alerting policies, uptime checks, and comprehensive dashboards.
  • Develop and maintain robust CI/CD pipelines utilizing Tekton PAC and/or GitHub Actions for application code deployment, automated operational tasks (e.g., instance management, cache invalidation, and data backups), and infrastructure changes.
  • Manage GitHub repositories for application code, automation scripts, and configuration management.
  • Establish and continually refine Incident Management and Problem Management processes, coordinating effectively with application teams for rapid resolution and thorough root cause analysis of issues.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service