About The Position

Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs. By transforming this work, Onebrief makes the staff as a whole superhuman - meaning faster, smarter, and more efficient. We take ownership, seek excellence, and play to win with the seriousness and camaraderie of an Olympic team. Onebrief operates as an all-remote company, though many of our employees work alongside our customers at military commands around the world. Founded in 2019 by a group of experienced planners, today, Onebrief’s team spans veterans from all forces and global organizations, and technologists from leading-edge software companies. We’ve raised $123m+ from top-tier investors, including Battery Ventures, General Catalyst, Insight Partners, and Human Capital, and today, Onebrief is valued at $1.1B. With this continued growth, Onebrief is able to make an impact where it matters most. This role requires regularly working on-site at customer locations in Colorado Springs, Colorado. If you are not currently within commuting distance, you must be willing to relocate (note that Onebrief will provide relocation assistance). Active Top Secret Clearance required; SCI eligibility is a plus.

Requirements

  • 3 years of experience in Site Reliability Engineering or a related field, with firsthand experience managing mission-critical systems within DoD’s air-gapped environments.
  • An active Top Secret security clearance. U.S. citizenship required.
  • Experience automating software delivery, deployment, and providing documentation and self-service tools for engineering teams and customers.
  • A strong understanding of Linux, containerization and orchestration, and virtual machines.
  • Experience with centralized logging, metrics, and observability using tools such as Prometheus, Loki, Grafana, ELK stack, or Datadog.
  • Networking fundamentals: core protocols and secure configurations.
  • A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement.
  • Clear, concise writing; strong documentation habits and async communication.
  • Core skills and technologies: VMWare, Kubernetes, Docker, Helm, Ansible, Terraform, Linux, AWS, DoD compliance, Monitoring and Observability tools, AWS.

Nice To Haves

  • Experience with compliance frameworks (RMF, STIGs/SRGs, ICD 503).
  • Security‑minded design for air-gapped environments.
  • Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment.

Responsibilities

  • Build a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana).
  • Define, measure, and own alerting that feeds into our Service Level Objectives (SLOs) and increases trust internally and externally.
  • Act as the incident responder and potentially incident commander during critical incidents.
  • Lead blameless post-mortems / After Action Reviews (AARs) that identify true root causes and drive automated, long-term solutions to prevent recurrence.
  • Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible).
  • Proactively identify and eliminate operational toil by building automation.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service