DevOps Engineer (Founding Team)

Fabrion•San Francisco Bay Area, CA

51d

About The Position

We're building an AI-native, multi-tenant enterprise platform for complex domains in industrial verticals. In this architecture, DevOps isn't just about shipping features — it's about operationalizing intelligent agents , ensuring traceability across AI systems , and supporting mission-critical ML infrastructure at scale. We're looking for a DevOps engineer who can own infrastructure from Day 1 — automating everything from CI/CD and observability to cloud governance and security. You’ll work with a highly technical team building real-time AI pipelines and multi-agent systems. If you want to be the person who makes the platform run — fast, secure, reliable, and explainable — this is your role.

Requirements

4–10+ years in DevOps, platform engineering, or SRE in production-grade systems
Strong experience with Docker, Kubernetes (EKS/GKE), Terraform or Pulumi
Hands-on experience deploying and monitoring distributed cloud-native systems
Familiar with GitOps practices, CI/CD design, progressive delivery, and secure SDLC
Clear understanding of how to implement monitoring, alerting, and failure simulation in dynamic environments
Obsessed with reliability, latency, uptime, and repeatability
Security-aware and compliance-conscious
Proactive — you don’t wait for alerts to fix things
Comfortable collaborating with backend, AI, and data teams

Nice To Haves

Experience running LLM orchestration frameworks (e.g. LangChain, LangGraph, Dust, ReAct agents)
Building retrieval-augmented generation (RAG) pipelines — and deploying them safely and repeatably
Familiarity with vector DBs (Weaviate, Qdrant, Pinecone) and embedding pipelines
Monitoring and governing long-running or multi-agent chains
Auditability and replay systems for agent decision-making
Serving fine-tuned or open-source LLMs with model versioning and GPU scaling (e.g. vLLM, TGI)
Interest in auto-remediation using agents (e.g. observability + alert → insight → response via LLM)

Responsibilities

Build and maintain scalable cloud infrastructure across AWS/GCP/Azure with a focus on secure, tenant-isolated deployments
Own and evolve CI/CD systems (e.g. GitHub Actions, ArgoCD) with progressive rollout, testing, and rollback flows
Establish observability tooling across services, agents, and pipelines (OpenTelemetry, Prometheus, Grafana, Sentry)
Implement policy-as-code (OPA, Rego) for deployment safety, RBAC, audit logging, and approval workflows
Define and enforce SLAs, uptime targets (99.99%+), incident response, and remediation workflows
Secure infrastructure: IAM, VPC, encryption, key management, image scanning, secrets rotation
Automate deployments, infrastructure provisioning (Terraform, Helm), and environment replication

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume