Help us make inference blazingly fast. If you love squeezing every last drop of performance out of GPUs, diving deep into CUDA kernels, and turning optimization techniques into production systems, we'd love to meet you. About Inference.net Inference.net trains and hosts specialized language models for companies that need frontier-quality AI at a fraction of the cost. The models we train match GPT-5 accuracy but are smaller, faster, and up to 90% cheaper. Our platform handles everything end-to-end: distillation, training, evaluation, and planet-scale hosting. We are a well-funded ten-person team of engineers who work in-person in downtown San Francisco on difficult, high-impact engineering problems. Everyone on the team has been writing code for over 10 years, and has founded and run their own software companies. We are high-agency, adaptable, and collaborative. We value creativity alongside technical prowess and humility. We work hard, and deeply enjoy the work that we do. Most of us are in the office 4 days a week in SF; hybrid works for Bay Area candidates. About the Role You will be responsible for making our inference stack as fast and efficient as possible. Your work spans from implementing known optimization techniques to experimenting with novel approaches, always with the goal of serving models faster and cheaper at scale. Your north star is inference performance: latency, throughput, cost efficiency, and how quickly we can bring new model architectures into production. You'll work across the full inference stack—from CUDA kernels to serving frameworks—to find and eliminate bottlenecks. This role reports directly to the founding team. You'll have the autonomy, a large compute budget, and technical support to push the limits of what's possible in model serving.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed