DevOps Engineer (Founding Team)

FabrionSan Francisco Bay Area, CA
11h

About The Position

We're building an AI-native, multi-tenant enterprise platform for complex domains in industrial verticals. In this architecture, DevOps isn't just about shipping features — it's about operationalizing intelligent agents , ensuring traceability across AI systems , and supporting mission-critical ML infrastructure at scale. We're looking for a DevOps engineer who can own infrastructure from Day 1 — automating everything from CI/CD and observability to cloud governance and security. You’ll work with a highly technical team building real-time AI pipelines and multi-agent systems. If you want to be the person who makes the platform run — fast, secure, reliable, and explainable — this is your role.

Requirements

  • 4–10+ years in DevOps, platform engineering, or SRE in production-grade systems
  • Strong experience with Docker, Kubernetes (EKS/GKE), Terraform or Pulumi
  • Hands-on experience deploying and monitoring distributed cloud-native systems
  • Familiar with GitOps practices, CI/CD design, progressive delivery, and secure SDLC
  • Clear understanding of how to implement monitoring, alerting, and failure simulation in dynamic environments
  • Obsessed with reliability, latency, uptime, and repeatability
  • Security-aware and compliance-conscious
  • Proactive — you don’t wait for alerts to fix things
  • Comfortable collaborating with backend, AI, and data teams

Nice To Haves

  • Experience running LLM orchestration frameworks (e.g. LangChain, LangGraph, Dust, ReAct agents)
  • Building retrieval-augmented generation (RAG) pipelines — and deploying them safely and repeatably
  • Familiarity with vector DBs (Weaviate, Qdrant, Pinecone) and embedding pipelines
  • Monitoring and governing long-running or multi-agent chains
  • Auditability and replay systems for agent decision-making
  • Serving fine-tuned or open-source LLMs with model versioning and GPU scaling (e.g. vLLM, TGI)
  • Interest in auto-remediation using agents (e.g. observability + alert → insight → response via LLM)

Responsibilities

  • Build and maintain scalable cloud infrastructure across AWS/GCP/Azure with a focus on secure, tenant-isolated deployments
  • Own and evolve CI/CD systems (e.g. GitHub Actions, ArgoCD) with progressive rollout, testing, and rollback flows
  • Establish observability tooling across services, agents, and pipelines (OpenTelemetry, Prometheus, Grafana, Sentry)
  • Implement policy-as-code (OPA, Rego) for deployment safety, RBAC, audit logging, and approval workflows
  • Define and enforce SLAs, uptime targets (99.99%+), incident response, and remediation workflows
  • Secure infrastructure: IAM, VPC, encryption, key management, image scanning, secrets rotation
  • Automate deployments, infrastructure provisioning (Terraform, Helm), and environment replication
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service