Member of Technical Staff, Kernels

Inception•San Francisco, CA

About The Position

Inception creates the world’s fastest, most efficient AI models. Our Mercury model is the world’s fastest reasoning LLM and first commercially available diffusion LLM, delivering 5x greater speed and efficiency than today’s LLMs, with best-in-class quality. We are the AI researchers and engineers behind such breakthrough AI technologies as diffusion models, flash attention, and DPO. We are looking for engineers and scientists to design, optimize, and maintain the compute foundations that power large-scale language model training. You will develop high-performance ML kernels (e.g., CUDA, CuTe, Triton), enable efficient low-precision arithmetic, and improve the distributed compute stack that makes training large models possible. Your work will make inference faster, more cost-effective, and more reliable.

Requirements

BS/MS/PhD in Computer Science, Engineering, or a related field (or equivalent experience)
Knowledge of ML serving frameworks (e.g., SGLang, vLLM, PyTorch, Triton, DeepSpeed, XLA)
Understanding of ML frameworks (PyTorch, TensorFlow) from a systems perspective
Proficiency in CUDA, CuTe, Triton, or other GPU programming frameworks.
Familiarity with distributed training techniques (data parallel, model parallel, pipeline parallel)
Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (e.g., XLA, TVM).
Proficiency in Python and at least one systems programming language (C++/Rust/Go)
Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD pipelines

Nice To Haves

Experience building and maintaining large-scale large-scale language models with tens of billions of parameters or more.
Experience with distributed systems and cloud computing platforms (AWS/GCP/Azure)
Experience with ML workflow orchestration tools (Kubeflow, Airflow)
Background in performance optimization and profiling of ML systems
Knowledge of ML-specific infrastructure challenges (checkpointing, resource scheduling, etc.)
Experience with MLOps practices and tooling

Responsibilities

Design and implement custom ML kernels (e.g., CUDA, CuTe, Triton) for core LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for modern GPU and accelerator architectures.
Design and think through compute primitives to reduce memory bandwidth bottlenecks and improve kernel compute efficiency.
Contribute to infrastructure stability and scalability, ensuring reproducibility, consistency across precision formats, and high utilization of compute resources.