Site Reliability Engineer

Steampunk•McLean, VA

87d•$125,000 - $200,000•Hybrid

About The Position

As a Site Reliability Engineer (SRE), you will help design, build, and operate reliable, secure, and observable cloud-native systems that support mission-critical applications and services. You will blend software engineering, DevOps practices, and infrastructure expertise to improve system reliability, performance, and operational excellence across the platform. Contributions

Requirements

Ability to obtain a U.S. government Security Clearance.
BS Degree in an IT field with 10 years of experience OR BS in a non-IT field and 12 years of related IT experience.
3 years of experience with one or more clouds (i.e. AWS, Azure, or GCP).
3 years of experience with Git SCM providers such as GitHub, GitLab, Bitbucket.
3 years of experience with at least one programming language (e.g., Python, Go, Java, or JavaScript) for tooling, automation, or application development.
Hands-on experience working with AWS in production environments.
Hands-on experience designing, deploying, and operating Kubernetes-based systems (e.g., EKS, AKS, GKE).
Experience with DevOps practices and tools, including CI/CD pipelines (e.g., GitHub Actions, GitLab CI, Jenkins, Azure DevOps).
Hands-on experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation, Pulumi) to manage cloud resources.
Experience configuring and managing containerization and orchestration platforms.
Experience implementing monitoring, logging, and tracing solutions (e.g., CloudWatch, Prometheus, Grafana, Datadog, New Relic, Elastic, OpenTelemetry).
Familiarity with networking fundamentals (DNS, load balancing, routing, TLS) and their impact on reliability and performance.
Experience with incident management, on-call operations, and production support practices.
Certification(s) such as: Cloud certifications (e.g., AWS DevOps Engineer, AWS SysOps Administrator, Azure Administrator/DevOps Engineer, GCP Professional Cloud DevOps Engineer). Kubernetes certifications (e.g., CKA, CKAD).

Nice To Haves

Hands-on experience with Drupal and Azure.
Experience implementing Automated Testing frameworks including Selenium.
Excellent written and verbal communication skills, interpersonal and collaborative skills.
Experience documenting an as-is state of the environment, perform a gap analysis, and produce artifacts that articulate options and recommendations.
Experience designing and implementing SLOs, SLIs, and error budgets in production environments.
Experience with chaos engineering, game days, and resilience testing.
Local to Washington, DC metro area and available to be onsite 2 days a week.
NIH experience.

Responsibilities

Establishing development tools and infrastructure for automation.
Understanding the needs of stakeholders and conveying this to developers.
Automate and improve development, testing, deployment, and release processes.
Testing and examining code written by others and analyzing results.
Own and improve the reliability, availability, and performance of production systems and services.
Define, implement, and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
Perform capacity planning, scalability analysis, and performance tuning for applications and infrastructure.
Participate in on-call rotations, incident response, and post-incident reviews to drive long-term improvements.
Design and implement infrastructure-as-code (IaC) to provision and manage cloud resources (e.g., AWS, Azure, GCP).
Build and maintain CI/CD pipelines to ensure reliable, repeatable delivery of application and infrastructure changes.
Engineer resilient architectures using concepts such as auto-scaling, blue/green deployments, canary releases, and self-healing patterns.
Collaborate with security and platform teams to ensure infrastructure adheres to compliance, security, and governance requirements.
Collaborate with application development teams to design reliable, observable, and operable services from the outset.
Contribute to application code, tooling, and frameworks that enhance reliability, resilience, and performance.
Act as an individual contributor and mentor more junior team members.
Present regular status updates and provide cross-training to other DevOps team members.