Principal Engineer, AI Platform & Infrastructure

SPREEAI•San Francisco, CA

About The Position

SPREEAI is building the future of AI-powered commerce through photorealistic virtual try-on and multimodal intelligence. We bring together cutting-edge AI and real-world retail to deliver production systems that redefine how people shop online. We are looking for a Principal Engineer to build the infrastructure, deployment pipelines, and observability systems that enable multimodal AI models to move from research prototypes to reliable, production-grade deployments powering real-time virtual try-on experiences for global retail partners. This role spans ML platform engineering, deployment systems, GPU infrastructure, and observability. You will partner closely with Applied Science, AI Platform, Product, and Partner Engineering to enable rapid research iteration and reliable model delivery at scale.

Requirements

10+ years of software engineering / infrastructure experience, with 5+ years in ML infrastructure, MLOps, distributed systems, or AI platform engineering.
Deep experience with Python, PyTorch, Kubernetes, Docker, cloud infrastructure, and GPU-based workloads.
Strong understanding of distributed systems and large-scale ML infrastructure design.
Experience with ML workflow orchestration systems such as Ray, Kubeflow, Argo, Airflow, Flyte, or Metaflow.
Experience deploying and managing production inference systems using platforms like Triton, vLLM, TensorRT-LLM, Ray Serve, KServe, Seldon, BentoML, TorchServe, or custom services.
Strong understanding of inference optimization techniques such as batching, quantization, CUDA graphs, and memory-aware scheduling.
Experience with model registries, experiment tracking, CI/CD for ML, canary deployments, shadow traffic, rollback strategies, and production monitoring.
Strong cloud experience across AWS, GCP, Azure, or GPU-focused providers like CoreWeave, Lambda Labs, or RunPod.
Ability to debug performance bottlenecks across distributed systems, containers, networking, GPU memory, and storage layers.
Strong ownership mindset with the ability to define architecture, set platform standards, and drive execution across teams.

Nice To Haves

Experience with multimodal, vision, or generative AI systems.
Experience with large-scale GPU clusters e.g. A100/H100, NCCL, and high-throughput data pipelines.
Experience designing evaluation and monitoring systems for generative AI workloads.
Familiarity with ML security, privacy, and data governance practices.
Experience building internal developer platforms for research teams.

Responsibilities

Build and operate SPREEAI’s end-to-end ML platform spanning training, evaluation, deployment, and monitoring.
Enable scalable and reliable training workflows through orchestration, infrastructure, and resource management systems.
Define platform standards for model packaging, model registry, dataset lineage, experiment tracking, checkpointing, and deployment automation.
Enable reliable and scalable inference deployments through standardized serving, orchestration, and monitoring frameworks.
Build and operate model deployment pipelines with versioning, reproducibility, rollback, approval gates, evaluation gates, and production observability.
Establish production SLOs for latency, availability, error rate, GPU saturation, cold-start time, cost per inference, and model quality drift.
Standardize and support serving infrastructure using modern inference runtimes such as vLLM, NVIDIA Triton, TensorRT-LLM, Ray Serve, TorchServe, ONNX Runtime, or equivalent systems.
Design and manage GPU allocation, scheduling, and resource utilization across training and inference workloads.
Improve GPU utilization, throughput, latency, reliability, and cost efficiency across model lifecycle systems.
Design and operate model evaluation and benchmarking systems, including automated regression detection and quality gates for production releases.
Partner with research teams to productionize new capabilities by providing robust infrastructure, tooling, and deployment pathways.

Benefits

Build the Core AI Infrastructure, Not Just Features: You will define how multimodal AI systems are reliably deployed, monitored, and scaled—directly shaping the performance, cost efficiency, and reliability of real-world AI products.
Own Systems End-to-End: You will own critical infrastructure decisions across deployment, observability, and resource management, with direct impact on production systems serving real partner traffic.
Work on Hard, High-Leverage Problems: From GPU efficiency to large-scale deployment systems, you will tackle challenges that sit at the frontier of real-time AI infrastructure.
High Velocity, Low Bureaucracy & Direct Impact: We operate with tight feedback loops between research, platform, and product, enabling rapid iteration and meaningful impact without organizational friction.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume