ML Engineer - Inference & Model Deployment

HiringCafe•Cupertino, CA

3d•Onsite

About The Position

HiringCafe is building a job search engine that is fast, comprehensive, honest, and useful. They are looking for a founding ML engineer to help turn AI and ML models into fast, reliable production systems. This role will focus on deploying models, optimizing inference, scaling serving systems, and ensuring efficient production performance. It is a hands-on engineering role for individuals who are passionate about model performance, GPU utilization, inference architecture, and production reliability.

Requirements

Deployed and optimized deep learning models in production environments.
Experience with large-scale model serving, multi-GPU inference, or high-throughput inference systems.
Understanding of inference optimization techniques such as quantization, pruning, compilation, batching, caching, and memory optimization.
Strong instincts for profiling, benchmarking, and debugging model performance.
Familiarity with efficient attention mechanisms, transformer optimization, or modern LLM/embedding/ranking model infrastructure.
Experience with inference frameworks or serving stacks such as SGLang, vLLM, TensorRT, or equivalent.
Ability to write clean, production-quality code and integrate ML systems into backend infrastructure.
Comfort with cloud platforms, distributed systems, storage systems, and modern ML training or serving workflows.
Desire for ownership, leverage, and responsibility from day one.

Responsibilities

Deploy and integrate researcher-trained model checkpoints into cloud infrastructure and production pipelines.
Profile and benchmark model performance to identify latency, throughput, memory, and compute bottlenecks.
Implement optimization techniques such as quantization, pruning, batching, caching, efficient attention, and precision trade-offs while preserving model quality.
Build scalable multi-GPU inference systems for search, ranking, recommendations, agents, and other AI-powered product experiences.
Design reliable model-serving architecture that can support millions of users.
Develop efficient training and fine-tuning workflows where needed, including distributed training, mixed precision, and parallelism strategies.
Work closely with search & engineering teams to make model deployment a smooth part of the development workflow.