Technical Lead - DevOps

Koalafi•Arlington, VA

73d•Onsite

About The Position

We're hiring a Tech Lead for DevOps to own the delivery and operational health of our platform engineering team. This is a hands-on leadership role where you will carry significant engineering weight alongside the team while owning the operating rhythm, sprint commitments, and accountability for what ships. You will work in close partnership with the Cloud Architect, who owns platform strategy and technology standards, translating that direction into a clear, well-organized backlog and driving the team to deliver it with quality and predictability. The role acts as the bridge between strategic direction and team execution. The environment is AI-forward, and you will be expected to leverage AI tools in your own work and create conditions for other engineers to do the same effectively and responsibly.

Requirements

7+ years of hands-on experience in cloud infrastructure, DevOps, SRE, or platform engineering
Demonstrated experience as a formal people manager or tech lead with direct reports: including performance management, career development, and building team capability
Demonstrated experience leading technical delivery: owning timelines, driving sprint execution, and being accountable for what a team ships
Ability to translate high-level technical direction into well-scoped, executable team work
Comfort partnering with senior technical peers (Cloud Architect, VP) and sound judgment on when to consult vs. decide independently
Strong hands-on experience with Terraform in production (modules, patterns, environment strategy, state management)
Strong hands-on experience operating Kubernetes in production (EKS strongly preferred)
Strong AWS fundamentals: practical experience with compute, networking, IAM, and production operations
Experience building and maintaining CI/CD pipelines (GitLab CI preferred; GitHub Actions transferable)
Strong observability fundamentals including metrics, logging, distributed tracing, SLO/SLI design, and alerting strategy: with experience evaluating and evolving observability practices at a platform level
Experience building automation using Bash and a general-purpose language (Go or Python)
Strong troubleshooting skills: you drive root cause analysis and implement long-term fixes
Hands-on experience using AI coding tools (e.g., GitHub Copilot, Cursor, Claude) as a productivity multiplier in production engineering work
This position requires regular in-person attendance at one of our two office locations (Richmond, VA or Arlington, VA). Candidates must already be located within a commutable distance to either location, as relocation assistance is not available at this time.

Nice To Haves

Experience with Istio or other service mesh technologies
Experience operating relational databases in AWS (RDS PostgreSQL/Aurora/MS SQL)
Experience with AWS Lambda or serverless architectures
Experience improving reliability for distributed systems at scale
Prior experience as a technical anchor or team lead in a platform or infrastructure context
Experience building or operating infrastructure that supports AI/ML workloads (compute, storage, serving patterns) in AWS

Responsibilities

Set and own team priorities: determine what the team works on, in what order, and why; translate VP direction and Cloud Architect input into a clear, executable backlog
Own sprint commitments and team capacity planning: accountable for what the team commits to and whether it ships
Surface risks early and communicate delivery status accurately to the VP
Run sprint ceremonies: planning, stand-ups, retros, and demos
Maintain Jira hygiene: tickets are well-defined, updated, and always reflect actual state
Identify and resolve blockers before they slow the team down
Communicate cross-team dependencies early and proactively
Be a strong technical contributor: carry significant engineering weight alongside the team and actively deliver on high-impact work
Own day-to-day technical decisions within the team's scope
Translate architectural direction into sprint-level tasks the team can act on
Build and evolve CI/CD pipelines and delivery automation: ensuring deployment safety, consistency, and velocity
Improve observability and operational readiness across metrics, logging, distributed tracing, and alerting (Prometheus, Grafana, Dynatrace, Elasticsearch), including actionable dashboards and SLO-based alerting
Design and implement automation and self-service workflows using infrastructure-as-code, APIs, and developer platforms to reduce developer friction
Implement secure delivery practices with policy-driven pipeline controls
Contribute to infrastructure in Terraform, working within established architectural patterns and standards
Support and improve secrets management patterns across runtime and CI/CD workflows
Champion AI-assisted development practices across the team: prompt engineering workflows, AI-powered code review, and tooling integrations (e.g., GitHub Copilot, Cursor, or equivalent) as first-class parts of the engineering workflow
Own incident response coordination: drive the process, communicate status, and ensure issues reach the right people
Participate in the on-call rotation and help drive improvements that reduce incidents and alert noise over time
Build and maintain operational runbooks, escalation paths, and documentation for team-owned systems
Drive production readiness as a continuous standard, not a one-time checklist
Manage a team of engineers: this is a formal people manager role with full accountability for the team's performance and growth
Own performance management: regular 1:1s, performance reviews, and direct, constructive feedback
Own career development: growth planning, identifying opportunities, and building engineer capability over time
Mentor engineers through code reviews, pairing, and delivery coaching
Build a team culture that is organized, reliable, and focused on impact
Manage team working norms, address blockers, and partner with the VP on people concerns that require escalation