Forward Deployed Site Reliability Engineer

Twenty•Fort Meade, MD

65d•Onsite

About The Position

At Twenty, the mission is to defend democracies in the digital age by developing revolutionary technologies that operate at the intersection of cyber and electromagnetic domains. The company focuses on delivering game-changing outcomes that directly impact national security, operating with a pragmatic optimism towards challenging missions. This role is for a Forward Deployed Site Reliability Engineer who will be on-site at a government customer location. The primary responsibility is to ensure the reliability and performance of Twenty's mission-critical platform within a restricted, air-gapped AWS environment. The position involves defining reliability metrics, leading incident response in constrained settings, and acting as the main technical liaison between the on-site operations and the engineering team in Arlington. The engineer will collaborate with a DevSecOps engineer to meet government security and compliance standards and with product engineers to provide operational feedback. This role reports directly to the VP of Engineering and is suited for individuals who thrive autonomously in high-stakes environments and are dedicated to making complex systems reliably performant.

Requirements

5+ years of professional experience in site reliability engineering, production operations, or a closely related infrastructure role.
Proven experience defining and tracking SLIs, SLOs, and error budgets in a production environment.
Hands-on experience with Docker, Docker Compose, and AWS (EC2, ECS, RDS, VPCs, security groups) in production deployments.
Solid Linux/Unix systems administration skills; productive in constrained environments where GUI tooling may be limited or unavailable.
Experience with Terraform for infrastructure provisioning and configuration, working within DSO-provided policy guardrails.
Experience with the LGTM observability stack or equivalent (Grafana, Loki, Prometheus/Mimir, distributed tracing).
Strong incident response experience: you've led responses, written post-mortems and runbooks, and shipped the preventive fix.
Scripting proficiency in Python or Bash for operational automation, with familiarity in Go a plus; experience with PagerDuty or equivalent on-call tooling.
Experience working in or directly supporting government or defense environments, including air-gapped or enclave deployments.
Must possess and be able to maintain a TS/SCI security clearance with appropriate polygraph.
U.S. citizenship required.
Willingness to travel occasionally for customer engagements and operational support.

Nice To Haves

Experience with NATS or similar pub/sub messaging systems in production.
Background in cyber operations, intelligence systems, or signals environments.
AWS certifications (Solutions Architect, SysOps, or DevOps Engineer).

Responsibilities

Define, track, and report on SLIs and SLOs for platform services running in the customer environment.
Use error budgets to drive reliability conversations with the Arlington engineering team, translating operational data into prioritized engineering work.
Identify and eliminate toil: build automation for repetitive operational tasks within the constraints of the secure environment.
Conduct post-incident reviews, own root cause analysis, and drive durable fixes in partnership with the engineering team.
Own the observability posture for the on-site deployment — dashboards, alerting thresholds, and log pipelines using the LGTM stack (Grafana, Loki, Tempo, Mimir).
Lead incident response on-site: triage, containment, coordination with Arlington, and customer communication.
Maintain and continuously improve runbooks for operational procedures and emergency response protocols.
Serve as the on-call anchor for the customer environment, with clear escalation paths to the engineering team.
Work with the customer deployment team to get Twenty's platform stood up and updated within the restricted environment.
Manage containerized services (Docker, Docker Compose) across deployment lifecycle — configuration, updates, rollbacks.
Apply and validate Terraform-based infrastructure changes within the enclave, in coordination with the DSO engineer who owns IaC policy and guardrails.
Perform capacity planning and flag scaling requirements to the Arlington team before they become incidents.
Serve as the primary technical interface between the government customer and Twenty's engineering team — translating operational requirements, constraints, and issues in both directions.
Represent the operational environment accurately in engineering discussions: what the team in Arlington can't see, you make visible.
Partner with the DevSecOps engineer on compliance, logging, and audit requirements specific to the customer environment.
Provide technical guidance and support to customer stakeholders on system behavior and troubleshooting procedures.