Director of SRE (FTE)

IntusCare

1d•Remote

About The Position

As Director of Site Reliability Engineering, you will own the reliability, availability, and operational excellence of Intus Care's entire product and platform portfolio. You will lead a blended organization that includes a managed SRE services team, an internal QA team, and a growing internal SRE capability — with full accountability for outcomes across all three. You will define the SRE strategy and roadmap, establish SLA and SLO frameworks across all products, design and own Intus Care's incident management process, and build the observability and release gate infrastructure that engineering teams depend on to ship safely and confidently. This role requires a leader who combines deep SRE and infrastructure expertise with strong operational management skills and the ability to influence engineering culture across multiple teams. You will partner closely with Engineering, Security, and Product leadership, and serve as the primary voice of reliability in cross-functional forums. This is a rare opportunity to build a reliability function from the ground up in a high-stakes healthcare environment where the work directly impacts patient care.

Requirements

12+ years of SRE, infrastructure, or platform engineering experience, with 5+ years of engineering leadership roles.
Proven track record owning site reliability for complex, multi-tenant SaaS platforms with demanding availability requirements.
Demonstrated experience defining SLA and SLO frameworks, error budgets, and incident management processes at scale.
Experience managing vendor relationships for managed infrastructure or SRE services, including SLA governance and performance management.
Track record leading QA or quality engineering functions, including test automation maturity and release gate ownership.
Strong communication and cross-functional influence skills — able to represent reliability to both technical and non-technical audiences
Strong hands-on experience with cloud infrastructure, preferably Microsoft Azure, including AKS, networking, storage, IAM, and security services.
Deep expertise in Kubernetes, containerized workloads, and production-scale distributed systems.
Experience building and managing CI/CD pipelines using GitHub Actions, ArgoCD, Terraform, or similar DevOps tooling.
Strong background in monitoring, logging, tracing, and observability platforms such as Grafana, Prometheus, Datadog, Splunk, or equivalent.
Experience with scripting and automation using Python, Bash, PowerShell, or similar languages.
Strong understanding of release engineering, automated testing frameworks, QA tooling, and shift-left quality practices.
Experience supporting SaaS applications with uptime, scalability, and security requirements in regulated industries such as healthcare.
Knowledge of HIPAA, SOC2, vulnerability management, access controls, and infrastructure security best practices.
Familiarity with databases, APIs, networking, and troubleshooting across modern web application stacks.

Nice To Haves

Exposure to AI-powered DevOps / AIOps tooling for incident management, automation, and engineering productivity is a plus.
Experience in healthcare technology, HIPAA-compliant environments, or other highly regulated SaaS industries.
Familiarity with FHIR-native or EMR/EHR platform architectures and their specific reliability requirements.
Experience implementing AI-assisted SRE automation including runbook generation, anomaly detection, or incident triage tooling.
Background working with Playwright or equivalent test automation frameworks in a QA leadership capacity.
Experience building internal SRE capability alongside a managed services provider

Responsibilities

Own and execute the SRE strategy and multi-quarter roadmap across reliability, observability, incident management, QA maturity, and release engineering.
Define, measure, and continuously improve SLAs, SLOs, error budgets, uptime, performance, and operational health metrics across all products and services.
Lead production reliability for the full platform, including monitoring, alerting, on-call operations, incident response, root cause analysis, and MTTR reduction.
Establish release readiness standards, deployment safety controls, and quality gates to ensure stable and predictable product releases.
Manage external SRE vendors and partners, including service delivery, SLA governance, escalations, performance reviews, and compliance expectations.
Lead QA engineering strategy with a focus on automation, regression prevention, test coverage, and reducing escaped defects in production.
Partner with Security and Engineering leaders to ensure cloud infrastructure, CI/CD pipelines, and operational tooling meet HIPAA, SOC2, and internal security standards.
Oversee core platform operations including Azure AKS environments, Kubernetes, GitOps workflows, CI/CD pipelines, GitHub Actions, secrets management, access controls, and audit readiness.
Drive observability maturity using tools such as Grafana, Prometheus, logging platforms, tracing tools, and automated alerting frameworks.
Collaborate with Product, Platform, and Engineering teams to embed reliability and quality best practices throughout the software development lifecycle.
Build, mentor, and scale high-performing SRE and QA teams while fostering a culture of ownership, accountability, learning, and continuous improvement.
Drive adoption of AI-enabled automation and intelligent tooling to reduce manual toil, improve productivity, and strengthen operational excellence.