Senior Site Reliability Engineer

Medeloop•San Francisco, CA

About The Position

We are seeking a Senior DevOps & Site Reliability Engineer to own the reliability, scalability, performance, and operational excellence of Medeloop’s platform. This role blends deep DevOps engineering—CI/CD pipelines, infrastructure as code, and cloud architecture—with SRE discipline: SLOs, incident management, capacity planning, observability and a relentless focus on system uptime. You will be the bridge between development and operations, ensuring our clinical research products are always available, performant, and secure for the healthcare organizations that depend on them.

Requirements

Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field.
7+ years of combined experience in DevOps and/or Site Reliability Engineering roles, with at least 2 years in a senior capacity.
Deep proficiency with AWS services
Deep experience with observability and monitoring platforms such as DataDog, AWS CloudWatch, and Sentry.
Strong experience building and maintaining CI/CD pipelines with GitHub Actions or equivalent tools.
Expertise in infrastructure as code using AWS CDK, CloudFormation, or Terraform.
Hands-on experience with containerization (Docker) and orchestration (Kubernetes).
Proven track record of defining and operating against SLOs/SLIs and managing incident response processes.
Solid understanding of networking (VPCs, subnets, load balancing, DNS), security, and compliance best practices.
Experience with authentication and authorization systems including AWS Cognito, Auth0, OAuth2, and SSO.
Proactive, self-directed mindset with a bias toward action and taking initiative.
Excellent problem-solving skills and the ability to work independently as well as collaboratively across teams.
Strong communication skills—able to explain complex infrastructure decisions clearly to technical and non-technical stakeholders.
Passion for unsolved challenges in healthcare AI, with the ability to thrive in a fast-paced, multidisciplinary environment and wear multiple hats.

Nice To Haves

Multi-cloud experience (AWS, Azure, GCP)
Familiarity with healthcare data standards, compliance, and protocols such as HIPAA, HL7 FHIR, OMOP, and i2b2.
Experience with chaos engineering practices and tools (e.g., AWS Fault Injection Simulator, Gremlin).
Prior experience in a healthcare or life sciences company operating under strict regulatory requirements.
Contributions to open-source infrastructure or SRE tooling.
Relevant certifications such as AWS Solutions Architect, Certified Kubernetes Administrator (CKA), or Google SRE certification.

Responsibilities

Cloud Infrastructure & Architecture
Design, implement, and manage scalable, secure, and highly available cloud infrastructure on AWS - infrastructure as code (IaC) using AWS CDK, CloudFormation, or Terraform, ensuring all environments are version-controlled and reproducible.
Architect multi-region and disaster recovery strategies that meet healthcare uptime requirements.
Manage containerized workloads using Docker and Kubernetes, optimizing for cost, performance, and resilience.
Site Reliability Engineering
Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across all production services.
Build and maintain observability stacks (DataDog, AWS CloudWatch, Sentry) covering metrics, logs, traces, and alerting.
Lead incident response: triage, mitigate, and drive blameless post-incident reviews with actionable follow-ups.
Conduct capacity planning and performance engineering to ensure the platform scales ahead of demand.
Champion error budgets and use them to balance feature velocity with system stability.
Identify, assess, and mitigate operational risks by collaborating with engineering and product teams to evaluate impact and likelihood before they become incidents.
Participate in and help structure an on-call rotation, ensuring clear escalation paths and fair distribution of after-hours coverage.
CI/CD & Automation
Build self-service tooling and runbooks that reduce toil and empower development teams to ship independently.
Design and maintain CI/CD pipelines (GitHub Actions) that enable fast, safe, and repeatable deployments.
Automate security scanning (SAST, DAST) within pipelines and collaborate with engineering to remediate findings.
Implement progressive delivery strategies such as canary deployments, blue-green releases, and feature flags.
Proficiency in scripting languages (Python, Bash) for automation, troubleshooting, and building reliability tooling.
Track and drive down operational toil, targeting less than 50% of team time spent on repetitive manual work.
Evaluate and manage change risk for production deployments, maintaining change review processes that balance speed with stability.
Security & Compliance
Ensure infrastructure meets healthcare compliance standards (HIPAA, SOC 2) through policy-as-code, encryption, and access controls.
Manage networking security (VPCs, subnets, security groups, WAFs) and identity/authentication systems (AWS Cognito, Auth0, OAuth2, SSO).
Conduct regular security reviews, vulnerability assessments, and patching across the infrastructure estate.
Collaboration & Culture
Partner closely with product and engineering teams to embed reliability thinking into the software development lifecycle.
Develop and maintain comprehensive documentation for infrastructure, runbooks, and operational playbooks.
Mentor junior engineers on DevOps and SRE best practices, fostering a culture of ownership and continuous improvement.
Stay current with advancements in cloud technologies, DevOps tooling, and SRE methodologies.
Own and evolve internal developer platform tooling — including deployment workflows (GitOps/Flux), bug tracking integrations, and developer self-service portals.

Benefits

Ownership from day one: small team, high-trust, no layers between your work and its impact
Technically ambitious: you'll build AI-powered workflows, not just support them
Real-world stakes: your work accelerates drug development, addresses health equity, and improves clinical research for institutions that matter
Strong foundation: Series A, top-tier investors, and a data asset (200M+ patient records) that most companies spend years trying to build

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume