DevOps Engineer

MenloSan Francisco, CA

About The Position

As an DevOps Engineer, you will own and evolve the platform that everything at Menlo runs on -- from inference serving, to training rigs, to the agentic coding infrastructure that powers day-to-day engineering. You will work deep in the stack across Kubernetes, networking, and where it matters bare metal, and help set the technical direction for how Menlo Cloud scales.

Requirements

  • Kubernetes -- deep, hands-on. Strong production experience with Kubernetes, fluent in workloads and controllers, networking (Services, Ingress, CNI basics), storage (PV/PVC, CSI), RBAC, and the autoscaling story end-to-end (HPA, VPA, Cluster Autoscaler, KEDA). Cloud-managed Kubernetes (GKE, EKS, AKS) is fine; on-premises / self-managed Kubernetes (kubeadm, Cluster API, k3s, etc.) is a strong plus.
  • Networking -- design-level, not just operator-level. You have designed real network topologies at some point in your career -- hub-and-spoke, multi-AZ / multi-VPC, or an equivalent enterprise pattern -- and can defend the tradeoffs. Comfortable with VPCs, firewalls, load balancers, private cluster architecture, DNS, and routing. On-premises networking experience (VLANs, BGP, L2/L3 fabrics, pfSense / Fortinet / Palo Alto / Cisco) is a strong plus.
  • CI/CD and Docker -- concepts over tooling. You can build and optimize Dockerfiles (multi-stage builds, layer caching, small/secure base images) and have owned full CI/CD pipelines end-to-end. Tooling is flexible -- GitHub Actions, GitLab CI, Azure Pipelines, Jenkins, Argo Workflows, etc. -- but you should be able to clearly articulate the full lifecycle of a typical pipeline, and explain how CI/CD changes when the deployment target is Kubernetes (ArgoCD / FluxCD, GitOps patterns, progressive delivery).
  • Observability -- you have built this before. You have stood up a full observability stack from scratch and operated it in production -- metrics, logs, traces, alerting, on-call. Familiarity with the Grafana stack (Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, Prometheus) is a strong plus. Bonus points if you have experimented with agent-assisted SRE workflows or LLM-driven incident triage.
  • SSO and identity. When you bring a new tool into the platform, your instinct is to wire it into a central IdP rather than leave it on local accounts. Comfortable with OpenID Connect, SAML, and traditional directory services (LDAP / Active Directory), and you have integrated tools with an IdP like Keycloak, Okta, Azure AD, or equivalent.
  • Linux and automation fundamentals. Strong Linux proficiency (RHEL/Ubuntu or equivalent) including basic performance and networking debugging. Comfort with infrastructure-as-code (Terraform / Terragrunt / Pulumi or equivalent) and configuration management.
  • Ownership mindset. Comfortable operating in a high-ownership environment where you make architecture decisions, push them to production, and own the outcomes.

Nice To Haves

  • Optional but valuable: hands-on experience operating any of Kafka, Redis, PostgreSQL, OpenSearch -- at production scale, including HA, backup/restore, and upgrade planning.
  • Experience with OpenStack in production: Nova, Neutron, Cinder, Trove, Horizon, and CLI administration.
  • Experience with KVM virtualization and storage backends like Ceph or Rook-Ceph on Kubernetes.
  • Familiarity with vLLM internals: PagedAttention, continuous batching, tensor parallelism.
  • Background in AI/ML infrastructure or GPU cluster operations at scale.
  • Experience with KEDA or event-driven autoscaling patterns in anger.
  • Prior open-source contributions to Kubernetes, OpenStack, or adjacent projects.
  • Kernel-level Linux debugging and performance tuning.

Responsibilities

  • Operate and evolve our Kubernetes platform across multiple clusters and environments (Prod, Dev, hybrid on-prem and public cloud), covering control plane operations, node lifecycle, upgrades, and autoscaling at every layer (Cluster Autoscaler, HPA, KEDA).
  • Architect and manage hybrid cloud infrastructure spanning on-premises and public clouds (GCP, AWS), including workload placement, cross-cloud networking, and unified resource management.
  • Own the CI/CD and GitOps experience end-to-end: container build pipelines, image optimization, and progressive delivery via ArgoCD / FluxCD.
  • Own the observability stack as a single pane of glass across all clusters: Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, Prometheus -- and help push toward agent-assisted SRE workflows.
  • Manage and improve our inference platform: vLLM serving and AIBrix for multi-model orchestration and autoscaling across a fleet of NVIDIA GPUs.
  • Operate platform services: Kafka, Redis, PostgreSQL, OpenSearch.
  • Manage identity and access via Keycloak integrated with Google Workspace; harden SSO, RBAC, and secrets management across the platform.
  • Harden network security across private load balancers, firewalls, and VPC segmentation; design and maintain hub-and-spoke / multi-AZ topologies.
  • Support training infrastructure: self-service VM provisioning, RunPod burst capacity, Weights and Biases integration.
  • Drive infrastructure reliability, cost efficiency, and capacity planning as the platform scales.

Benefits

  • Menlo Cloud is a first-class investment built from the ground up, and it sits at the center of everything we do, from coding agents to humanoid robots.
  • You will have genuine ownership over a platform that is technically ambitious, cost-conscious by design, and critical to the mission.
  • If you want to build infrastructure that actually matters and have the autonomy to do it right, this is the place.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service