Principal Service Reliability Engineer

Prescryptive Health, Inc.•Washington, DC

1d•$150,000 - $205,000

About The Position

Prescryptive is seeking a Principal Service Reliability Engineer to ensure the reliability, scalability, security, and performance of their cloud-based platforms across pre-production and production environments. This senior individual contributor role focuses on technical leadership, architectural influence, and reliability strategy. The engineer will collaborate with engineering, product, and platform teams to build resilient systems and establish SRE best practices. As a healthcare technology company, Prescryptive emphasizes exceeding HIPAA requirements through proactive infrastructure design, automation, monitoring, and operational excellence. The ideal candidate is a technical expert skilled in influencing without authority and driving reliability at scale.

Requirements

Bachelor’s degree in Computer Science, Computer Engineering, Information Security, or equivalent practical experience.
8+ years of experience in SRE, DevOps, or infrastructure engineering roles.
Deep expertise in managing cloud-based infrastructure, preferably in Azure.
Strong experience with Kubernetes, including: Cluster setup, networking, access control, and authorization; Deployments, services, config maps, secrets, and cronjobs; Designing, deploying, and maintaining service mesh infrastructure.
Strong experience with GitHub Actions and CI/CD pipelines.
Experience supporting production environments and high-availability systems.
Familiarity with Agile methodologies (Scrum, sprints, backlogs).
Experience managing certificates, secrets, and monitoring systems.
Strong collaboration skills across a large, evolving engineering organization.
Demonstrated ability to lead complex technical initiatives across multiple teams.
Bias toward action with a focus on continuous improvement.
Cloud Infrastructure: Load balancers, VNets, DNS, network firewalls, API Management, Service Bus, Storage accounts, Automation functions, Virtual Machines, scale sets, and Azure Virtual Desktops, Active Directory and domain services, Monitoring and alerting systems, Secret and key management.
Platform & Data: Kubernetes clusters and container registries, kubectl and container operations, Redis, SQL, PostgreSQL, and MongoDB.

Nice To Haves

Experience operating at a Principal, Staff, or senior technical leadership level.
Experience in healthcare or HIPAA-regulated environments.
Experience designing and scaling SRE practices (SLOs, observability, incident management).
Track record of improving reliability, performance, or operational maturity at scale.
Strong communication and stakeholder management skills.
Demonstrated ability to mentor and influence engineers without direct authority.
Curiosity and willingness to learn and adopt new technologies.

Responsibilities

Define and evolve build and deployment pipelines for secure, reliable production releases.
Lead the design and operation of pre-production and production cloud infrastructure to ensure high availability, performance, and security.
Partner with Engineering teams to improve release flow from development to test and production.
Define and implement monitoring, alerting, and incident response practices.
Lead complex troubleshooting efforts, root cause analysis, and drive systemic post-incident improvements.
Evaluate, recommend, and implement new infrastructure technologies and services.
Ensure platforms meet or exceed healthcare security and compliance standards.
Establish and drive adoption of SRE best practices (SLOs, SLIs, error budgets, reliability engineering standards).
Serve as a technical leader and advisor across Service Reliability, DevOps, and engineering teams.
Mentor engineers through design reviews, knowledge sharing, and guidance on best practices.
Influence system design and architectural decisions to improve scalability and resiliency.
Partner with teams to prioritize reliability initiatives and reduce operational risk.
Contribute to defining engineering standards, best practices, and operational runbooks.
Foster a culture of ownership, accountability, reliability, and continuous improvement.