AI Infrastructure Engineer

Matter Inc•Menlo Park, CA

About The Position

Matter is building the AI-native autonomy stack for physical manufacturing in the United States. We operate our own factories, deploy our own software, and collect data from every stage of production — from CAD intake to finished goods. Our platform, MatterOS, is the unified software layer for factory operations, process orchestration, and autonomy deployment. The data pipeline that feeds it — from machine telemetry on the floor to model training in the cloud — is the infrastructure you will build and own. We are hiring an AI Infrastructure Engineer to design and operate the data and compute systems that power MatterOS and our Sim2Real training pipeline. You will work across edge computing, cloud training infrastructure, and the data pipelines that make our “Smart Data” strategy real. Your job is to ensure that every data point — from a torque sensor reading to a camera frame — is tagged with the machine ID, process state, and production context that makes it trainable.

Requirements

3+ years of experience in ML infrastructure, MLOps, or data engineering in a production environment
Strong command of distributed data systems: Kafka, Flink, or equivalent; time-series databases (InfluxDB, TimescaleDB, or similar)
Experience with GPU cluster management and distributed training (SLURM, Ray, or Kubernetes-based)
Familiarity with industrial protocols: OPC UA, MQTT, Modbus (or willingness to learn quickly)
Proficiency in Python; comfort with C++ or Rust for performance-critical edge components is a plus
Systems thinking: you understand that data quality, not data volume, is what makes AI work in constrained physical environments

Nice To Haves

Experience with NVIDIA Isaac Sim, ROS2, or edge AI deployment (Jetson, FPGA, or similar)
Background in industrial IoT or factory automation systems
Familiarity with model serving frameworks (Triton, TorchServe, or ONNX Runtime)

Responsibilities

Design and maintain the edge-to-cloud data pipeline with semantic context preserved end-to-end
Build and manage GPU compute infrastructure for VLA model training, experiment tracking, and distributed training workflows
Implement the data collection layer for 100% capture from modular assembly workcells, including camera feeds, sensor streams, machine state, and process metadata
Develop feature engineering pipelines that transform raw operational data into structured training inputs for AI models
Manage model deployment to edge hardware in the factory: latency, versioning, rollback, and monitoring in production
Build observability systems that surface model performance degradation, data drift, and equipment anomalies in real time
Collaborate with AI researchers to translate model requirements into infrastructure specifications and vice versa

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume