Site Reliability Engineer Site Reliability Engineer

JLABHCM20•Newport News, VA

7d•Hybrid

About The Position

At Jefferson Lab, you’ll champion cutting-edge science and operational excellence while shaping the future of discovery. Join us and make your mark – where excellence meets purpose, and great minds truly matter. You embed within the HPDF architecture team to make reliability, resilience, and observability first-class features of the facility's scientific data lifecycle systems — not afterthoughts. You define the initial Service Level Objectives (SLOs) and Service Level Indicators (SLIs), establish monitoring and alerting foundations, influence technology selections across compute, storage, and networking, and build the automation tooling that eliminates manual operations risk. When the facility transitions to operations, you lead the HPDF SRE team, owning availability metrics, incident response, and the continuous improvement processes that keep the facility performing to its design parameters.

Requirements

10 or more years SRE (Site Reliability Engineering), DevOps, or Systems Engineering roles
Bachelor's Degree Computer Science or related field
Deep experience and understanding of distributed systems principles, failure modes, consensus protocols and self-healing architectures.
Expertise in defining and implementing SLOs and SLIs and comprehensive monitoring stacks and experience architecting observability frameworks in greenfield environments (e.g. Prometheus, ELK, OpenTelemetry)
Strong scripting and automation skills (Go, Python, Shell).

Nice To Haves

Master's Degree Computer Science or related field
Deep experience with public cloud environments (AWS, Azure, GCP) and container orchestration (Kubernetes).
Experience with configuration management and IaC tools (e.g., Terraform, Puppet, Ansible).
Experience with IPv4 and IPv6 networking, high-speed interconnects and data transfer protocols, familiarity with network reliability patterns and software-defined networking (pref)
Experience with HPC infrastructure and environments (pref)
Experience leading or mentoring small teams (pref)

Responsibilities

Work closely with the rest of the architecture team to review and influence technology choices to establish reliability, and resilience parameters (e.g., meeting expected availability, failure domain isolation, disaster recovery)
Ensure the selected software and hardware systems meet those parameters, while also meeting performance expectations and security requirements.
Evaluate vendor and open-source solutions against established reliability and resilience parameters, develop comparative assessments, and provide technically grounded recommendations to inform architecture decisions and support acquisitions.
Establish the foundation for system observability, defining initial SLOs/SLIs, architecting, prototyping and then implementing comprehensive monitoring, logging, and alerting solutions.
Lead the design, prototyping and implementation of these solutions including custom automation to eliminate manual operations and further improve facility resilience.
Participate in testing and performance analysis to validate reliability and resilience design decisions, to identify bottlenecks and alternative approaches.
Define the operational framework, on-call structures, incident response, other operational processes, and staffing plans for the future SRE team, bridging the design-to-operations transition.

Benefits

Medical, Dental, and Vision Care Plans
Flexible Spending Accounts
Paid Time-off and Leave Programs (Paid Parental, vacation, holidays, and sick leave)
401(k) Plan – 9% Lab Contribution; 100% vested
Flexible Work Arrangements (Remote & Alternate Work Schedules available)
Tuition Assistance, Training and Professional Development Programs

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume