Technical Lead - DevOps

KoalafiArlington, VA
Onsite

About The Position

We're hiring a Tech Lead for DevOps to own the delivery and operational health of our platform engineering team. This is a hands-on leadership role where you will carry significant engineering weight alongside the team while owning the operating rhythm, sprint commitments, and accountability for what ships. You will work in close partnership with the Cloud Architect, who owns platform strategy and technology standards, translating that direction into a clear, well-organized backlog and driving the team to deliver it with quality and predictability. The role acts as the bridge between strategic direction and team execution. The environment is AI-forward, and you will be expected to leverage AI tools in your own work and create conditions for other engineers to do the same effectively and responsibly.

Requirements

  • 7+ years of hands-on experience in cloud infrastructure, DevOps, SRE, or platform engineering
  • Demonstrated experience as a formal people manager or tech lead with direct reports: including performance management, career development, and building team capability
  • Demonstrated experience leading technical delivery: owning timelines, driving sprint execution, and being accountable for what a team ships
  • Ability to translate high-level technical direction into well-scoped, executable team work
  • Comfort partnering with senior technical peers (Cloud Architect, VP) and sound judgment on when to consult vs. decide independently
  • Strong hands-on experience with Terraform in production (modules, patterns, environment strategy, state management)
  • Strong hands-on experience operating Kubernetes in production (EKS strongly preferred)
  • Strong AWS fundamentals: practical experience with compute, networking, IAM, and production operations
  • Experience building and maintaining CI/CD pipelines (GitLab CI preferred; GitHub Actions transferable)
  • Strong observability fundamentals including metrics, logging, distributed tracing, SLO/SLI design, and alerting strategy: with experience evaluating and evolving observability practices at a platform level
  • Experience building automation using Bash and a general-purpose language (Go or Python)
  • Strong troubleshooting skills: you drive root cause analysis and implement long-term fixes
  • Hands-on experience using AI coding tools (e.g., GitHub Copilot, Cursor, Claude) as a productivity multiplier in production engineering work
  • This position requires regular in-person attendance at one of our two office locations (Richmond, VA or Arlington, VA). Candidates must already be located within a commutable distance to either location, as relocation assistance is not available at this time.

Nice To Haves

  • Experience with Istio or other service mesh technologies
  • Experience operating relational databases in AWS (RDS PostgreSQL/Aurora/MS SQL)
  • Experience with AWS Lambda or serverless architectures
  • Experience improving reliability for distributed systems at scale
  • Prior experience as a technical anchor or team lead in a platform or infrastructure context
  • Experience building or operating infrastructure that supports AI/ML workloads (compute, storage, serving patterns) in AWS

Responsibilities

  • Set and own team priorities: determine what the team works on, in what order, and why; translate VP direction and Cloud Architect input into a clear, executable backlog
  • Own sprint commitments and team capacity planning: accountable for what the team commits to and whether it ships
  • Surface risks early and communicate delivery status accurately to the VP
  • Run sprint ceremonies: planning, stand-ups, retros, and demos
  • Maintain Jira hygiene: tickets are well-defined, updated, and always reflect actual state
  • Identify and resolve blockers before they slow the team down
  • Communicate cross-team dependencies early and proactively
  • Be a strong technical contributor: carry significant engineering weight alongside the team and actively deliver on high-impact work
  • Own day-to-day technical decisions within the team's scope
  • Translate architectural direction into sprint-level tasks the team can act on
  • Build and evolve CI/CD pipelines and delivery automation: ensuring deployment safety, consistency, and velocity
  • Improve observability and operational readiness across metrics, logging, distributed tracing, and alerting (Prometheus, Grafana, Dynatrace, Elasticsearch), including actionable dashboards and SLO-based alerting
  • Design and implement automation and self-service workflows using infrastructure-as-code, APIs, and developer platforms to reduce developer friction
  • Implement secure delivery practices with policy-driven pipeline controls
  • Contribute to infrastructure in Terraform, working within established architectural patterns and standards
  • Support and improve secrets management patterns across runtime and CI/CD workflows
  • Champion AI-assisted development practices across the team: prompt engineering workflows, AI-powered code review, and tooling integrations (e.g., GitHub Copilot, Cursor, or equivalent) as first-class parts of the engineering workflow
  • Own incident response coordination: drive the process, communicate status, and ensure issues reach the right people
  • Participate in the on-call rotation and help drive improvements that reduce incidents and alert noise over time
  • Build and maintain operational runbooks, escalation paths, and documentation for team-owned systems
  • Drive production readiness as a continuous standard, not a one-time checklist
  • Manage a team of engineers: this is a formal people manager role with full accountability for the team's performance and growth
  • Own performance management: regular 1:1s, performance reviews, and direct, constructive feedback
  • Own career development: growth planning, identifying opportunities, and building engineer capability over time
  • Mentor engineers through code reviews, pairing, and delivery coaching
  • Build a team culture that is organized, reliable, and focused on impact
  • Manage team working norms, address blockers, and partner with the VP on people concerns that require escalation

Benefits

  • Comprehensive medical, dental, and vision coverage
  • 20 PTO days + 11 paid holidays
  • 401(k) retirement with company matching
  • Student Loan & Tuition Reimbursement
  • Commuter assistance
  • Parental leave (maternal + paternal)
  • Inclusion and Associate Engagement Programs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service