Staff Machine Learning Systems Engineer (MLOps)

Hims & Hers,
$210,000 - $250,000Remote

About The Position

Hims & Hers is seeking a Staff ML Systems Engineer to design, build, and operate the production infrastructure for AI across the company. This is a hands-on role focused on the underlying systems of AI, including Kubernetes, CI/CD, infrastructure-as-code, inference/model-serving, and observability. The engineer will own critical systems like EKS clusters, deployment infrastructure, IAM, secrets management, LLM tracing/observability pipelines, and the developer platform. The role involves partnering with ML engineers, product engineers, and clinical teams to ensure AI systems are reliable, observable, secure, and trustworthy in a regulated healthcare environment. This position is ideal for individuals with a systems and infrastructure mindset, a focus on reliability, security, and cost, and a desire to shape AI production environments where they directly impact patient outcomes.

Requirements

  • 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering
  • At least 3 years focused on ML/AI systems in production
  • Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem (autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, process/job orchestration)
  • Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures (IAM, OIDC, secrets management, least-privilege access)
  • Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines
  • 2+ years of experience operating LLM-based systems in production (LLMOps) (inference routing, serving, tracing, reliability patterns)
  • Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines
  • Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling
  • A systems-and-operations mindset (failure modes, SLOs, observability, security, maintainability)
  • Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives
  • Strong collaboration skills across engineering, ML, product, security, and clinical teams
  • A deep appreciation for safety, privacy, and security
  • Ideally with experience in a regulated domain such as healthcare, fintech, or life sciences

Nice To Haves

  • Experience with AWS (EKS, Bedrock, S3, CloudFront, IAM) and multi-cloud (GCP/Vertex AI) inference routing
  • Experience with Databricks (MLflow, Unity Catalog, Spark, Delta) and data platform access governance
  • Experience provisioning LLM observability infrastructure (Langfuse, ClickHouse, OpenTelemetry/OTLP tracing, LogFire) and LLM behavior monitoring
  • Experience with Karpenter, cluster autoscaling, and cost optimization for ML compute
  • Experience with monorepo build systems (Pants, Bazel) and large-scale CI/CD
  • Experience building automated PR-review / convention-enforcement pipelines and developer-workflow standards
  • Familiarity with Vertex AI Agent Builder, Vertex AI Model Registry, or GCP managed AI/ML services
  • Contributions to open-source infrastructure, IaC modules, SDKs, or developer tooling projects

Responsibilities

  • Own and scale the AI compute and deployment platform, including Kubernetes cluster operations, autoscaling, storage, and workload isolation.
  • Build and maintain GitOps-based deployment pipelines for safe and repeatable AI service shipping.
  • Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines for validation.
  • Drive efficiency and cost management across compute, autoscaling, and inference infrastructure.
  • Build and scale inference infrastructure and a multi-provider LLM AI gateway, managing credentials, rate limits, and failover.
  • Build reliable serving patterns for LLM-powered workflows.
  • Create reusable infrastructure abstractions and contracts for AI service deployment and consumption.
  • Own the LLM/AI observability and tracing stack, including provisioning and scaling systems like Langfuse, Datadog, and OpenTelemetry.
  • Build analytics and monitoring pipelines for latency, error, quality, and regression signals.
  • Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure, leading troubleshooting and improving platform reliability.
  • Own and improve the monorepo build system and CI/CD pipelines for AI workloads.
  • Own shared infrastructure tooling, CLIs, and IaC modules.
  • Identify and eliminate platform bottlenecks to improve developer velocity.
  • Build IAM, OIDC, and secrets management as first-class infrastructure.
  • Encode security-by-default, scope boundaries, and access controls into the platform for HIPAA compliance.
  • Partner with clinical, legal, security, and data platform teams to enforce compliant data access.
  • Drive multi-quarter infrastructure initiatives, including cluster architecture, inference platform, GPU compute strategy, and observability.
  • Write and lead technical design documents and design reviews.
  • Define infrastructure standards and development-workflow conventions.
  • Contribute to technical governance across AI engineering.
  • Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices.
  • Bridge the gap between prototypes and production-grade systems.

Benefits

  • Competitive salary & equity compensation for full-time roles
  • Unlimited PTO, company holidays, and quarterly mental health days
  • Comprehensive health benefits including medical, dental & vision, and parental leave
  • Employee Stock Purchase Program (ESPP)
  • 401k benefits with employer matching contribution
  • Offsite team retreats
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service