ML Engineer - Inference & Model Deployment

HiringCafe•Cupertino, CA

4d•$250,000 - $310,000•Onsite

About The Position

HiringCafe is building a job search engine that aims to be 100x better than existing platforms like Indeed and LinkedIn. They are looking for a founding ML engineer to help transform AI and ML models into efficient, reliable production systems. This role will focus on deploying models, optimizing inference performance (latency and throughput), scaling serving systems, and ensuring efficient production operation of models. It's a hands-on engineering position for individuals passionate about model performance, GPU utilization, inference architecture, and production reliability.

Requirements

Experience deploying and optimizing deep learning models in production environments.
Experience with large-scale model serving, multi-GPU inference, or high-throughput inference systems.
Understanding of inference optimization techniques such as quantization, pruning, compilation, batching, caching, and memory optimization.
Strong skills in profiling, benchmarking, and debugging model performance.
Familiarity with efficient attention mechanisms, transformer optimization, or modern LLM/embedding/ranking model infrastructure.
Experience with inference frameworks or serving stacks like SGLang, vLLM, TensorRT, or equivalents.
Ability to write clean, production-quality code and integrate ML systems into backend infrastructure.
Comfort with cloud platforms, distributed systems, storage systems, and modern ML training or serving workflows.
Desire for ownership, leverage, and responsibility from day one.

Responsibilities

Deploy and integrate researcher-trained model checkpoints into cloud infrastructure and production pipelines.
Profile and benchmark model performance to identify bottlenecks in latency, throughput, memory, and compute.
Implement optimization techniques like quantization, pruning, batching, caching, efficient attention, and precision trade-offs while maintaining model quality.
Build scalable multi-GPU inference systems for various AI-powered product experiences such as search, ranking, recommendations, and agents.
Design reliable model-serving architecture capable of supporting millions of users.
Develop efficient training and fine-tuning workflows, including distributed training, mixed precision, and parallelism strategies.
Collaborate with search and engineering teams to ensure smooth model deployment integration into the development workflow.