Staff Machine Learning Systems Engineer (MLOps)

Hims & Hers•,

1d•$210,000 - $250,000•Remote

About The Position

Hims & Hers is seeking a Staff ML Systems Engineer to design, build, and operate the production infrastructure for AI across the company. This is a hands-on role focused on the underlying systems of AI, including Kubernetes, CI/CD, infrastructure-as-code, inference/model-serving, and observability. The engineer will own critical systems like EKS clusters, deployment infrastructure, IAM, secrets management, LLM tracing/observability pipelines, and the developer platform. The role involves partnering with ML engineers, product engineers, and clinical teams to ensure AI systems are reliable, observable, secure, and trustworthy in a regulated healthcare environment. This position is ideal for individuals with a systems and infrastructure mindset, a focus on reliability, security, and cost, and a desire to shape AI production environments where they directly impact patient outcomes.

Requirements

8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering
At least 3 years focused on ML/AI systems in production
Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem (autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, process/job orchestration)
Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures (IAM, OIDC, secrets management, least-privilege access)
Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines
2+ years of experience operating LLM-based systems in production (LLMOps) (inference routing, serving, tracing, reliability patterns)
Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines
Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling
A systems-and-operations mindset (failure modes, SLOs, observability, security, maintainability)
Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives
Strong collaboration skills across engineering, ML, product, security, and clinical teams
A deep appreciation for safety, privacy, and security
Ideally with experience in a regulated domain such as healthcare, fintech, or life sciences

Nice To Haves

Experience with AWS (EKS, Bedrock, S3, CloudFront, IAM) and multi-cloud (GCP/Vertex AI) inference routing
Experience with Databricks (MLflow, Unity Catalog, Spark, Delta) and data platform access governance
Experience provisioning LLM observability infrastructure (Langfuse, ClickHouse, OpenTelemetry/OTLP tracing, LogFire) and LLM behavior monitoring
Experience with Karpenter, cluster autoscaling, and cost optimization for ML compute
Experience with monorepo build systems (Pants, Bazel) and large-scale CI/CD
Experience building automated PR-review / convention-enforcement pipelines and developer-workflow standards
Familiarity with Vertex AI Agent Builder, Vertex AI Model Registry, or GCP managed AI/ML services
Contributions to open-source infrastructure, IaC modules, SDKs, or developer tooling projects

Responsibilities

Own and scale the AI compute and deployment platform, including Kubernetes cluster operations, autoscaling, storage, and workload isolation.
Build and maintain GitOps-based deployment pipelines for safe and repeatable AI service shipping.
Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines for validation.
Drive efficiency and cost management across compute, autoscaling, and inference infrastructure.
Build and scale inference infrastructure and a multi-provider LLM AI gateway, managing credentials, rate limits, and failover.
Build reliable serving patterns for LLM-powered workflows.
Create reusable infrastructure abstractions and contracts for AI service deployment and consumption.
Own the LLM/AI observability and tracing stack, including provisioning and scaling systems like Langfuse, Datadog, and OpenTelemetry.
Build analytics and monitoring pipelines for latency, error, quality, and regression signals.
Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure, leading troubleshooting and improving platform reliability.
Own and improve the monorepo build system and CI/CD pipelines for AI workloads.
Own shared infrastructure tooling, CLIs, and IaC modules.
Identify and eliminate platform bottlenecks to improve developer velocity.
Build IAM, OIDC, and secrets management as first-class infrastructure.
Encode security-by-default, scope boundaries, and access controls into the platform for HIPAA compliance.
Partner with clinical, legal, security, and data platform teams to enforce compliant data access.
Drive multi-quarter infrastructure initiatives, including cluster architecture, inference platform, GPU compute strategy, and observability.
Write and lead technical design documents and design reviews.
Define infrastructure standards and development-workflow conventions.
Contribute to technical governance across AI engineering.
Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices.
Bridge the gap between prototypes and production-grade systems.

Benefits

Competitive salary & equity compensation for full-time roles
Unlimited PTO, company holidays, and quarterly mental health days
Comprehensive health benefits including medical, dental & vision, and parental leave
Employee Stock Purchase Program (ESPP)
401k benefits with employer matching contribution
Offsite team retreats

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume