ML Engineer - Inference

photalabs.com•San Jose, CA

13h•Onsite

About The Position

At Phota Labs, we’re building visual GenAI that helps people capture, express, and relive their memories — in ways that feel effortless, personal, and emotionally resonant. Our core technology enables personalized image generation that faithfully reflects who you are and the moments you experienced. Our first goal is to bring visual GenAI into everyday photography. We're a small team of researchers, engineers, and designers who have always been at the forefront of how people capture, edit, and share images and videos. We build with our hands and hearts. We believe GenAI is the next shift for photography, and are seeking builders who share this vision — people like us, like you. We're just getting started! The role: As our first ML Engineer specializing in inference and optimization, you'll bridge the gap between cutting-edge research models and production systems. Your expertise will transform PyTorch research code into highly optimized, low-latency inference solutions that power our user-facing applications. You'll work closely with our GenAI researchers, vision ML engineers, and backend team to deliver exceptional performance.

Requirements

Experience deploying and optimizing deep learning models for production environments, particularly with multi-GPU inference and large-scale model serving.
Well-versed in cutting-edge techniques for optimizing both inference and training workloads.
Possess strong knowledge of efficient attention mechanisms and algorithms.
Hands-on experience implementing model quantization and working with inference frameworks.
Can write production-quality code and successfully integrate ML models into robust inference pipelines.
Familiar with various cloud platforms, storage solutions, and modern training frameworks.

Responsibilities

Deploy and integrate researcher-trained model checkpoints into our cloud infrastructure and production pipelines.
Conduct thorough performance profiling and benchmarking to identify and eliminate computational bottlenecks.
Implement neural network optimization techniques including quantization, pruning, and architectural refinements while preserving model accuracy.
Develop efficient training and fine-tuning strategies with optimal precision trade-offs and parallelism.
Build and maintain scalable multi-GPU inference solutions with sophisticated model parallelism and serving architectures.
Collaborate with the research team to ensure optimization integrate smoothly with model development workflows.