Staff DevOps Engineer

webAI•Austin, TX

51d

About The Position

We are seeking a Staff DevOps Engineer to architect, build, and scale secure infrastructure for deploying AI workloads across cloud and edge environments. This is a high-impact, staff-level individual contributor role where you will drive infrastructure strategy, lead technical initiatives, and serve as the subject matter expert on cloud architecture, security best practices, and platform reliability. You will design scalable, automated infrastructure solutions that enable our AI platform to operate efficiently across diverse deployment scenarios—from public cloud to on-premises and edge computing environments. This role requires deep technical expertise, architectural thinking, and the ability to translate complex requirements into production-ready infrastructure automation.

Requirements

7+ years of hands-on experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering with proven track record of architecting production systems
Expert-level proficiency with Docker, Kubernetes (CKA/CKAD preferred), and cloud-native technologies in production environments
5+ years implementing Infrastructure as Code with Terraform, Ansible, or Pulumi, managing large-scale (50+) cloud resources
Deep experience with cloud platforms (AWS, Azure, or GCP) including compute, networking, storage, and managed services
Proven experience building and scaling CI/CD pipelines with integrated security controls (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
Strong programming skills in Python (preferred for automation), Bash, or Go for infrastructure tooling and automation
Production experience with observability and monitoring tools: Prometheus, Grafana, ELK, CloudWatch, Datadog, or similar
Experience with MLOps workflows: model deployment automation, versioning, and lifecycle management
Demonstrated experience with GitOps methodologies and declarative infrastructure management
Strong understanding of security best practices: encryption, secrets management, identity and access management (IAM), network security
Excellent written and verbal communication skills for technical documentation and cross-functional collaboration

Nice To Haves

Experience architecting multi-cloud or hybrid cloud environments with portability and interoperability considerations
Hands-on experience deploying large language models (LLMs) or transformer models at scale with model serving infrastructure
Expertise in Zero Trust architecture and modern security patterns for cloud-native applications
Experience with service mesh technologies (Istio, Linkerd) for microservices communication and observability
Strong understanding of AI/ML infrastructure: feature stores, model registries, A/B testing infrastructure, and model monitoring
Experience with edge computing deployments and distributed system architectures
Cost optimization expertise: FinOps practices, resource rightsizing, and cloud cost management
Experience mentoring or leading technical initiatives across engineering teams
Certifications: CKA, CKAD, Terraform Associate, AWS Solutions Architect, Azure Administrator, or GCP Professional Cloud Architect

Responsibilities

Design and architect secure, scalable cloud and edge infrastructure for deploying AI workloads across multi-cloud (AWS, Azure, GCP) and hybrid environments
Build and maintain production-grade Infrastructure as Code (IaC) using Terraform, Ansible, or Pulumi, managing 100+ resources with GitOps workflows and automated validation
Design and operate production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing container security, multi-tenancy, and resource optimization
Implement secure CI/CD pipelines with integrated security controls (SAST, DAST, vulnerability scanning, secrets management) and automated deployment workflows for containerized AI models
Lead MLOps infrastructure initiatives including model deployment pipelines, versioning, feature stores, experiment tracking, and monitoring for model performance and drift
Design comprehensive observability and monitoring using Prometheus, Grafana, ELK, or Datadog with distributed tracing, APM, and real-time alerting aligned to SLIs/SLOs
Implement security best practices including least-privilege access, encryption at rest/in transit, network segmentation, and automated compliance validation
Lead incident response and reliability initiatives, participate in on-call rotation, conduct post-mortems, and drive continuous improvement for system reliability
Architect disaster recovery and business continuity strategies with automated backup, failover, and recovery processes
Develop reusable infrastructure modules and templates to accelerate environment provisioning and standardize deployment patterns across teams
Mentor mid-level and senior engineers on cloud architecture, DevOps best practices, and platform reliability through design reviews and technical guidance
Drive technical documentation and knowledge sharing including runbooks, architecture decision records (ADRs), and infrastructure standards