SRE Lead (US Remote)

First Advantage
5d$120,000 - $150,000Remote

About The Position

At First Advantage (Nasdaq: FA), people are at the heart of everything we do. From our customers and partners to our greatest advantage — our team members. Operating with empathy and compassion, First Advantage fosters a global inclusive workforce devoted to the diverse voices that make up our talent and products. Our team members empower each other to be their authentic selves and treat all with respect, integrity, and fairness. Say hello to a rewarding career, and come join a leading provider of mission-critical background screening solutions to some of the most recognized Fortune 100 and Global 500 brands. First Advantage is a global leader in background screening, identity, and verification solutions. As we continue to scale our digital platforms and modern cloud-native infrastructure, we are seeking a highly skilled and forward-thinking Lead Site Reliability Engineer (SRE) to drive reliability, resilience, and operational excellence across our systems. The Lead SRE will be responsible for guiding reliability strategy, overseeing complex incident response, improving observability, strengthening automation and CI/CD practices, and partnering closely with engineering teams to embed SRE principles throughout the organization. This role requires a deep understanding of modern cloud architecture—including both Azure and AWS—as well as expertise in Linux systems, monitoring technologies, and root‑cause analysis. This is a senior hands-on engineering role, ideal for someone who enjoys solving difficult problems at scale and mentoring others while driving meaningful improvements to uptime, performance, and customer experience.

Requirements

  • 7+ years in SRE, DevOps, Platform Engineering, or Cloud Engineering.
  • Strong expertise in Azure and AWS.
  • Proficiency in CI/CD, automation, and release engineering.
  • Deep monitoring, logging, and observability experience.
  • Incident response leadership experience.
  • Proven RCA experience.
  • Strong Linux skills.
  • Scripting skills (Python, Bash, PowerShell, Go).
  • IaC experience.
  • Strong systems and networking fundamentals.

Nice To Haves

  • Experience with large-scale distributed systems.
  • Message queues or event streaming knowledge.
  • Familiarity with incident management frameworks.
  • Multi-cloud enterprise experience.
  • Kubernetes, ECS, AKS, or EKS exposure

Responsibilities

  • Lead reliability initiatives across multiple high-availability, large-scale SaaS systems, ensuring platform uptime, performance, and resilience.
  • Build and maintain distributed systems, infrastructure components, and automation tooling to ensure consistent, reliable delivery of production services.
  • Champion proactive reliability engineering, holistic system monitoring, and continuous operational improvements.
  • Partner with architecture, engineering, and operations teams to define SLAs, SLOs, and SLIs.
  • Architect, build, and maintain cloud infrastructure using best practices.
  • Guide cloud migrations, cost optimization, and resilience engineering across multi-cloud environments.
  • Implement and enforce cloud security, compliance, and governance standards.
  • Create and maintain CI/CD pipelines using GitHub Actions, Azure DevOps, Jenkins, or equivalent.
  • Automate deployments using IaC tools (Terraform, Bicep, CloudFormation).
  • Reduce manual operational burden through automation and self-service tooling.
  • Implement observability stacks covering metrics, logs, traces, and synthetic checks.
  • Standardize monitoring practices using industry tooling.
  • Perform performance analysis, load testing, and optimization.
  • Serve as Incident Commander for major production incidents.
  • Define and improve incident management processes.
  • Ensure clear communication during outages and lead technical bridges.
  • Deliver high‑quality RCAs with actionable follow‑ups.
  • Drive deep, data‑driven RCAs and long-term reliability improvements.
  • Identify and eliminate systemic issues and operational toil.
  • Provide technical leadership across teams.
  • Mentor engineers and promote SRE best practices.
  • Foster strong cross‑functional partnerships.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service