Site Reliability Engineer

SteampunkMcLean, VA
$125,000 - $200,000Hybrid

About The Position

As a Site Reliability Engineer (SRE), you will help design, build, and operate reliable, secure, and observable cloud-native systems that support mission-critical applications and services. You will blend software engineering, DevOps practices, and infrastructure expertise to improve system reliability, performance, and operational excellence across the platform. Contributions

Requirements

  • Ability to obtain a U.S. government Security Clearance.
  • BS Degree in an IT field with 10 years of experience OR BS in a non-IT field and 12 years of related IT experience.
  • 3 years of experience with one or more clouds (i.e. AWS, Azure, or GCP).
  • 3 years of experience with Git SCM providers such as GitHub, GitLab, Bitbucket.
  • 3 years of experience with at least one programming language (e.g., Python, Go, Java, or JavaScript) for tooling, automation, or application development.
  • Hands-on experience working with AWS in production environments.
  • Hands-on experience designing, deploying, and operating Kubernetes-based systems (e.g., EKS, AKS, GKE).
  • Experience with DevOps practices and tools, including CI/CD pipelines (e.g., GitHub Actions, GitLab CI, Jenkins, Azure DevOps).
  • Hands-on experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation, Pulumi) to manage cloud resources.
  • Experience configuring and managing containerization and orchestration platforms.
  • Experience implementing monitoring, logging, and tracing solutions (e.g., CloudWatch, Prometheus, Grafana, Datadog, New Relic, Elastic, OpenTelemetry).
  • Familiarity with networking fundamentals (DNS, load balancing, routing, TLS) and their impact on reliability and performance.
  • Experience with incident management, on-call operations, and production support practices.
  • Certification(s) such as: Cloud certifications (e.g., AWS DevOps Engineer, AWS SysOps Administrator, Azure Administrator/DevOps Engineer, GCP Professional Cloud DevOps Engineer). Kubernetes certifications (e.g., CKA, CKAD).

Nice To Haves

  • Hands-on experience with Drupal and Azure.
  • Experience implementing Automated Testing frameworks including Selenium.
  • Excellent written and verbal communication skills, interpersonal and collaborative skills.
  • Experience documenting an as-is state of the environment, perform a gap analysis, and produce artifacts that articulate options and recommendations.
  • Experience designing and implementing SLOs, SLIs, and error budgets in production environments.
  • Experience with chaos engineering, game days, and resilience testing.
  • Local to Washington, DC metro area and available to be onsite 2 days a week.
  • NIH experience.

Responsibilities

  • Establishing development tools and infrastructure for automation.
  • Understanding the needs of stakeholders and conveying this to developers.
  • Automate and improve development, testing, deployment, and release processes.
  • Testing and examining code written by others and analyzing results.
  • Own and improve the reliability, availability, and performance of production systems and services.
  • Define, implement, and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
  • Perform capacity planning, scalability analysis, and performance tuning for applications and infrastructure.
  • Participate in on-call rotations, incident response, and post-incident reviews to drive long-term improvements.
  • Design and implement infrastructure-as-code (IaC) to provision and manage cloud resources (e.g., AWS, Azure, GCP).
  • Build and maintain CI/CD pipelines to ensure reliable, repeatable delivery of application and infrastructure changes.
  • Engineer resilient architectures using concepts such as auto-scaling, blue/green deployments, canary releases, and self-healing patterns.
  • Collaborate with security and platform teams to ensure infrastructure adheres to compliance, security, and governance requirements.
  • Collaborate with application development teams to design reliable, observable, and operable services from the outset.
  • Contribute to application code, tooling, and frameworks that enhance reliability, resilience, and performance.
  • Act as an individual contributor and mentor more junior team members.
  • Present regular status updates and provide cross-training to other DevOps team members.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service