Staff Software Engineer, ML Infrastructure

Voxel•San Francisco, CA

1d•Hybrid

About The Position

Voxel is building the future of Computer Vision and Machine Learning for operations, risk, and safety. Our technology addresses key cost drivers for workers’ compensation, general liability, and property damage. We've passed $10M ARR with strong expansion revenue and are backed by industry-leading VCs. This role focuses on Voxel's perception system, the core of our product. Our models detect human activity, equipment interactions, environmental hazards, and operational state in real time. As a Staff Software Engineer, you will own ML Infrastructure, setting the technical direction for training, tracking, and shipping vision models. You will build foundational systems for the applied ML team, shape architectural decisions for our ML stack, and write code while owning outcomes end-to-end. You will collaborate with applied CV engineers, the ML Data team, and the Platform team, serving as the technical voice for ML infrastructure tradeoffs.

Requirements

7+ years of experience building and shipping large-scale software systems.
At least 3 years of experience focused on ML infrastructure or large-scale data infrastructure.
Proven track record of defining system architecture, including tool selection, framework choices, and build-vs-buy decisions for systems used by other engineers.
Deep fluency in PyTorch and the modern ML training stack, with a strong understanding of experiment tracking, reliable large-scale training pipelines, and failure modes.
Strong Python skills, with experience writing performant and maintainable production code.
A pragmatic approach to shipping, with the ability to differentiate between critical architectural decisions and those that can be revisited.
Strong communication skills, capable of clearly explaining complex tradeoffs to ML researchers, infrastructure peers, and leadership.

Nice To Haves

Production experience on AWS (S3, EC2, EKS, or similar) for ML workloads.
Hands-on experience with model export and inference optimization (TensorRT, ONNX, etc.), including measuring accuracy and latency tradeoffs.
Experience with modern ML orchestration tools (e.g., Ray, Sematic, Flyte, Metaflow, Prefect).
Familiarity with GPU performance profiling and optimization tools (e.g., Nsight, PyTorch profiler).
Background in computer vision model training.

Responsibilities

Set the technical direction for ML infrastructure at Voxel, including build vs. buy decisions and system integration as the team and model portfolio scale.
Architect and build training infrastructure to enable the applied ML team to run concurrent experiments and iterate quickly on new architectures using PyTorch on AWS.
Own the train-to-deploy process, including exporting trained models to optimized inference formats (TensorRT, ONNX), quantifying accuracy and latency impacts, and collaborating with the Platform team on production deployment.
Select and implement an experiment tracking and lifecycle management stack (e.g., Weights & Biases, MLflow, ClearML) for efficient experiment execution, comparison, and reproduction.
Establish DevOps-for-ML best practices, including Infrastructure as Code (IaC), CI/CD, observability, and cost monitoring, to facilitate rapid and safe iteration by researchers.
Mentor engineers across Vision & AI on ML infrastructure best practices, enhancing the organization's approach to training, evaluation, and deployment.
Anticipate future infrastructure needs, including the upcoming transition to on-device inference, and architect solutions for the next 12-18 months.