Site Reliability Engineer ( Must have an active TS SCI with POLY)

Aperio GlobalMcLean, VA
1d$185 - $225Onsite

About The Position

This role supports a critical mission and requires an active U.S. Government Security Clearance at the TS/SCI level with a required polygraph. We’re looking for a full-time Observability Engineer (OE) who is driven by understanding not just what is happening in complex, cloud-native systems—but why. You’ll be part of a highly collaborative data collection and software development team responsible for ensuring services meet the reliability, performance, and uptime expectations of our customers. This is an environment where systems evolve quickly, and attention to detail matters. You’ll help us stay ahead by keeping a constant pulse on capacity, performance, and cost—while continuously improving how we see, understand, and respond to system behavior. As an OE, you’ll design and build monitoring and observability solutions that give teams deep visibility into operational health. Your work will directly support mission success by enabling faster troubleshooting, stronger system insight, and better customer outcomes across the entire technology stack.

Requirements

  • Active/current TS/SCI with required polygraph
  • Bachelor’s degree in Computer Science or a related field
  • 5+ years of relevant engineering experience
  • Hands-on experience with Kubernetes, Docker, Helm, and CI/CD pipelines (e.g., Jenkins or Concourse)
  • Familiarity with distributed version control systems such as Git
  • Experience working in AWS cloud environments
  • Proven experience implementing monitoring and observability solutions across complex systems and data feeds
  • Proficiency in Python and Java scripting
  • Advanced knowledge of Unix/Linux, with strong command-line comfort
  • Willingness to work onsite full time and participate in on-call rotations
  • A collaborative mindset and a sense of ownership when things go wrong

Nice To Haves

  • Experience with additional cloud providers beyond AWS
  • Familiarity with AWS CloudWatch or other native monitoring tools
  • Experience using Prometheus, Grafana, or similar tools for ETL pipelines, APIs, servers, networks, C2S services, and AI/ML platforms
  • Strong understanding of networking fundamentals
  • Experience with incident and problem management processes
  • Root Cause Analysis (RCA) experience
  • Exposure to ETL workflows and data pipelines
  • Organized, detail-oriented, and comfortable documenting and communicating work
  • Willingness to step into leadership roles during incidents—guiding others and driving issues to resolution

Responsibilities

  • Define and uphold standards for monitoring reliability, availability, performance, and maintainability of sponsor-owned systems
  • Design and architect operational solutions that support both applications and infrastructure
  • Drive service acceptance by introducing new operational processes, monitoring strategies, and automation to reduce risk and repeat issues
  • Partner closely with service and product owners to define key performance indicators (KPIs) and identify meaningful trends
  • Provide deep, hands-on troubleshooting support for production issues
  • Work with service owners to quickly identify root causes and restore services during performance or availability incidents
  • Build or leverage tools that correlate data across multiple systems to accelerate root-cause analysis
  • Coordinate with the sponsor during major incidents, large-scale deployments, and SecOps user support activities

Benefits

  • Health Care Plan (Medical, Dental & Vision)
  • Retirement Plan (401k, IRA)
  • Life Insurance (Basic, Voluntary & AD&D)
  • Paid Time Off (Vacation, Sick & Public Holidays)
  • Short Term & Long Term Disability
  • (and much more)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service