About The Position

Voxel is looking for a Staff Machine-Learning Infrastructure Engineer to drive the next wave of our computer-vision platform for workplace safety. You will be the technical owner for three pillars of our ML lifecycle - ground-truth data & labeling workflows, large-scale training infrastructure, and continuous model lifecycle management. If you excel at designing cloud-native, distributed systems that turn raw video into production-ready, version-controlled models, we'd love to meet you.

Requirements

  • Bachelor's (or higher) in Computer Science, EE, or related field.
  • 5+ years building and operating large-scale infrastructure, with at least 3 years focused on ML or data-intensive systems.
  • Proven record designing highly available, distributed systems on Kubernetes (EKS, GKE, or on-prem).
  • Deep expertise with orchestration (K8s operators, Argo, Kubeflow), and cluster-scale storage / compute (S3, GCS, Ray, Spark, Dask).
  • Hands-on experience automating data-labeling or ground-truth workflows and maintaining dataset versioning.
  • Strong software-engineering fundamentals; familiar with best practices for testing, observability, and secure coding.
  • Demonstrated DevOps mindset - IaC (Terraform/CDK), CI/CD (GitHub Actions, ArgoCD), metrics & alerting (Prometheus/Grafana).

Nice To Haves

  • Experience running multi-instance / multi-GPU training jobs, mixed-precision optimizations, or TensorRT / Triton inference.
  • Familiarity with active-learning, continuous-training, or online distillation pipelines.
  • Background in model registry tooling (MLflow, BentoML, SageMaker Registry) and evaluation dashboards.
  • Prior work with computer-vision models (YOLO, DETR, Faster RCNN) or video understanding at scale.
  • Contributions to open-source ML infra projects or published talks/blogs on MLOps.
  • Exposure to edge-deployment or real-time inference systems.
  • Experience shipping high quality production code in Python.

Responsibilities

  • Own data & labeling pipelines - architect scalable labeling services (storage, query, retrieval), design ontologies, automate annotation workflows, and build quality-tiered datasets that stay within cost constraints.
  • Build and operate training infrastructure - create multi-GPU / multi-node training frameworks (Ray, Spark, Kubernetes), optimize distributed jobs, and integrate accelerators (TensorRT, CUDA-graph, FP8, etc.).
  • Manage the full model lifecycle - stand up model registries, version control, evaluation suites, and continuous-learning loops that push updates from dev → staging → prod with zero-downtime rollbacks.
  • Provide technical leadership, mentorship, and lightweight project management to a small infra + research squad.
  • Establish DevOps-for-ML best practices (IaC, CI/CD, observability, cost monitoring) so researchers can iterate quickly and safely.
  • Partner with ML engineers on architecture decisions, from data schemas to inference optimizations, ensuring infra and research road-maps stay tightly aligned.

Benefits

  • Extensive / Generous health, dental, and vision insurance.
  • Highly competitive paid parental leave and support system.
  • Ownership in the business through an Equity Incentive Plan.
  • Generous paid time off and / or flexible work arrangements.
  • Daily meals in-office, vibrant company events, team-building.
  • 401K retirement plan, HSA options, pre-tax Commuter Card.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service