Senior Site Reliability Engineer

Fabric

7h•$120,000 - $145,000

About The Position

As a Senior Site Reliability Engineer, you will own and evolve the infrastructure powering healthcare experiences for millions of patients. This role bridges the gap between traditional infrastructure excellence and the future of AI-driven operations. You will act as a primary architect for our AWS and Kubernetes (EKS) environment, ensuring the platform is resilient, scalable, and compliant while exploring how agentic workflows can modernize SRE practices.

Requirements

5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale.
Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management.
Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems.
Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go.
A "rigor-first" mindset with a dedication to HIPAA-compliant, high-availability architecture.

Nice To Haves

Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency.

Responsibilities

Infrastructure & Kubernetes Orchestration
Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users.
Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform.
Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability.
AI-Assisted Operations & Automation
Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks.
Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe.
Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems.
Observability & Incident Management
Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs.
Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR).
Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards.
Compliance & Collaboration
Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements.
Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews.