Site Reliability Lead

The Vanguard GroupMalvern, PA
1dHybrid

About The Position

At Vanguard, we don't just have a mission—we're on a mission. To work for the long-term financial wellbeing of our clients. To lead through product and services that transform our clients' lives. To learn and develop our skills as individuals and as a team. From Malvern to Melbourne, our mission drives us forward and inspires us to be our best. Vanguard has implemented a hybrid working model for the majority of our crew members, designed to capture the benefits of enhanced flexibility while enabling in-person learning, collaboration, and connection. We believe our mission-driven and highly collaborative culture is a critical enabler to support long-term client outcomes and enrich the employee experience.

Requirements

  • Minimum 8 years of related experience, with at least 5 years in software development.
  • Bachelor’s degree (B.E./B.Tech) in Computer Science or IT, or Bachelor’s in Computer Applications (BCA) from a recognized institution.
  • Expertise in Site Reliability Engineering (SRE), DevOps, and system reliability, ensuring high availability and performance.
  • Strong programming and scripting skills in Python, Go, Bash, or Java, with experience in automating operational tasks.
  • Proficiency in observability and resiliency tools such as Splunk, Honeycomb, Datadog, Prometheus, or Grafana.
  • Hands-on experience with cloud platforms (AWS, Azure, GCP) and containerization/orchestration tools like Kubernetes, Docker, ECS, or Fargate.
  • Solid understanding of automation, Infrastructure-as-Code (IaC), and configuration management using Terraform, Ansible, or CloudFormation.
  • Experience with CI/CD pipelines, deployment automation, and version control tools like GitHub, Bitbucket, Jenkins, or Bamboo.
  • Deep knowledge of incident management, root cause analysis, and post-incident reviews, focusing on continuous improvement

Nice To Haves

  • Experience in mobile platform reliability (Android, iOS), including performance monitoring and optimization is desired.

Responsibilities

  • Ensure system reliability, stability and performance by maintaining service-level objectives (SLOs) and minimizing downtime and incidents.
  • Collaborate with internal teams to assess system health, stability and resilience, providing architectural and design recommendations for reliability.
  • Lead incident management and post-incident reviews, diagnosing issues, deploying fixes and implementing preventive measures.
  • Drive automation of operational tasks, including deployments, monitoring, scaling and system recovery, to improve efficiency and reduce manual intervention.
  • Define and track key performance indicators (KPIs) such as availability, latency and error rates to optimize system performance and inform decision-making.
  • Plan and execute chaos engineering experiments to test system resilience and coordinate performance testing for reliability improvements.
  • Ensure alignment between service-level indicators (SLIs) and service-level objectives (SLOs) across the product family.
  • Develop and maintain product-level runbooks for incident response, collaborating with SRE teams to ensure effective recovery processes.
  • Provide leadership in tool selection and best practices for site reliability engineering (SRE), making final decisions on tools, libraries and standards.
  • Work closely with development teams to improve software reliability, scalability and resilience by offering feedback on design and architecture.
  • Lead troubleshooting and triage efforts during user-impacting incidents, ensuring swift resolution and minimal disruption.
  • Participate in special projects and continuous improvement initiatives, supporting long-term reliability and scalability goals.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service