Senior/Staff Software Engineer - ML Infrastructure

Voxel•San Francisco, CA

77d

About The Position

Voxel is looking for a Staff Machine-Learning Infrastructure Engineer to drive the next wave of our computer-vision platform for workplace safety. You will be the technical owner for three pillars of our ML lifecycle - ground-truth data & labeling workflows, large-scale training infrastructure, and continuous model lifecycle management. If you excel at designing cloud-native, distributed systems that turn raw video into production-ready, version-controlled models, we'd love to meet you.

Requirements

Bachelor's (or higher) in Computer Science, EE, or related field.
5+ years building and operating large-scale infrastructure, with at least 3 years focused on ML or data-intensive systems.
Proven record designing highly available, distributed systems on Kubernetes (EKS, GKE, or on-prem).
Deep expertise with orchestration (K8s operators, Argo, Kubeflow), and cluster-scale storage / compute (S3, GCS, Ray, Spark, Dask).
Hands-on experience automating data-labeling or ground-truth workflows and maintaining dataset versioning.
Strong software-engineering fundamentals; familiar with best practices for testing, observability, and secure coding.
Demonstrated DevOps mindset - IaC (Terraform/CDK), CI/CD (GitHub Actions, ArgoCD), metrics & alerting (Prometheus/Grafana).

Nice To Haves

Experience running multi-instance / multi-GPU training jobs, mixed-precision optimizations, or TensorRT / Triton inference.
Familiarity with active-learning, continuous-training, or online distillation pipelines.
Background in model registry tooling (MLflow, BentoML, SageMaker Registry) and evaluation dashboards.
Prior work with computer-vision models (YOLO, DETR, Faster RCNN) or video understanding at scale.
Contributions to open-source ML infra projects or published talks/blogs on MLOps.
Exposure to edge-deployment or real-time inference systems.
Experience shipping high quality production code in Python.

Responsibilities

Own data & labeling pipelines - architect scalable labeling services (storage, query, retrieval), design ontologies, automate annotation workflows, and build quality-tiered datasets that stay within cost constraints.
Build and operate training infrastructure - create multi-GPU / multi-node training frameworks (Ray, Spark, Kubernetes), optimize distributed jobs, and integrate accelerators (TensorRT, CUDA-graph, FP8, etc.).
Manage the full model lifecycle - stand up model registries, version control, evaluation suites, and continuous-learning loops that push updates from dev → staging → prod with zero-downtime rollbacks.
Provide technical leadership, mentorship, and lightweight project management to a small infra + research squad.
Establish DevOps-for-ML best practices (IaC, CI/CD, observability, cost monitoring) so researchers can iterate quickly and safely.
Partner with ML engineers on architecture decisions, from data schemas to inference optimizations, ensuring infra and research road-maps stay tightly aligned.

Benefits

Extensive / Generous health, dental, and vision insurance.
Highly competitive paid parental leave and support system.
Ownership in the business through an Equity Incentive Plan.
Generous paid time off and / or flexible work arrangements.
Daily meals in-office, vibrant company events, team-building.
401K retirement plan, HSA options, pre-tax Commuter Card.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

Bachelor's degree

Number of Employees

101-250 employees

Senior/Staff Software Engineer - ML Infrastructure

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company