Senior DevOps Engineer

Cellebrite•Tysons, VA

About The Position

We are building a rapidly scaling GenAI-powered SaaS platform that enables investigators to interact with complex case data through a conversational AI interface. Our system leverages RAG architecture and agentic GenAI workflows to deliver advanced AI capabilities in production. We are looking for a Senior DevOps / Cloud Engineer to own our application services, cloud infrastructure, deployment pipelines, and production reliability in this dynamic AI environment. This is a hands-on role focused on serverless architecture, LLM-based systems, and agentic workflows, working closely with Engineering and Customer Success to ensure the platform is reliable, scalable, and cost-efficient.

Requirements

5+ years of experience in DevOps / SRE / Cloud Engineering
Strong hands-on experience with Google Cloud Platform (GCP)
Proven experience with serverless architectures (Cloud Run, Cloud Functions, or similar)
Experience working with BigQuery (querying, performance tuning, troubleshooting)
Experience running and supporting production SaaS applications
Hands-on experience with GenAI / LLM-based applications in production (including RAG systems, model APIs, or similar)
Experience supporting or operating multi-step AI pipelines or agentic workflows
Strong experience with CI/CD pipelines (GitHub Actions, etc.)
Solid scripting/programming skills (Python, TypeScript, Bash, or similar)
Experience with observability and monitoring tools

Nice To Haves

Experience optimizing LLM performance, cost, and reliability at scale
Familiarity with vector databases, embeddings, and retrieval systems
Experience with infrastructure as code (Terraform or similar)
Background in secure or regulated environments
Experience in fast-scaling or experimental product environments

Responsibilities

Own and manage application services running on GCP infrastructure, including serverless and managed services
Design and maintain robust CI/CD pipelines for rapid, safe deployments
Operate and optimize GenAI/LLM workloads in production, including RAG pipelines and agentic workflows
Monitor and improve latency, cost, and reliability of AI-driven systems
Troubleshoot complex production issues across application, data, and infrastructure layers
Work with and optimize BigQuery-based data workflows, queries, and performance
Support and debug multi-step AI pipelines and agent orchestration flows
Implement and maintain observability (logging, metrics, tracing, alerting), including for AI pipelines
Collaborate with engineering teams on architecture improvements for evolving GenAI systems
Partner with Customer Success to investigate and resolve customer-impacting issues (minimal direct customer interaction)
Enforce security and best practices in a sensitive data environment