Member of Technical Staff, Kernels

InceptionSan Francisco, CA
6h

About The Position

Inception creates the world’s fastest, most efficient AI models. Our Mercury model is the world’s fastest reasoning LLM and first commercially available diffusion LLM, delivering 5x greater speed and efficiency than today’s LLMs, with best-in-class quality. We are the AI researchers and engineers behind such breakthrough AI technologies as diffusion models, flash attention, and DPO. We are looking for engineers and scientists to design, optimize, and maintain the compute foundations that power large-scale language model training. You will develop high-performance ML kernels (e.g., CUDA, CuTe, Triton), enable efficient low-precision arithmetic, and improve the distributed compute stack that makes training large models possible. Your work will make inference faster, more cost-effective, and more reliable.

Requirements

  • BS/MS/PhD in Computer Science, Engineering, or a related field (or equivalent experience)
  • Knowledge of ML serving frameworks (e.g., SGLang, vLLM, PyTorch, Triton, DeepSpeed, XLA)
  • Understanding of ML frameworks (PyTorch, TensorFlow) from a systems perspective
  • Proficiency in CUDA, CuTe, Triton, or other GPU programming frameworks.
  • Familiarity with distributed training techniques (data parallel, model parallel, pipeline parallel)
  • Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (e.g., XLA, TVM).
  • Proficiency in Python and at least one systems programming language (C++/Rust/Go)
  • Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD pipelines

Nice To Haves

  • Experience building and maintaining large-scale large-scale language models with tens of billions of parameters or more.
  • Experience with distributed systems and cloud computing platforms (AWS/GCP/Azure)
  • Experience with ML workflow orchestration tools (Kubeflow, Airflow)
  • Background in performance optimization and profiling of ML systems
  • Knowledge of ML-specific infrastructure challenges (checkpointing, resource scheduling, etc.)
  • Experience with MLOps practices and tooling

Responsibilities

  • Design and implement custom ML kernels (e.g., CUDA, CuTe, Triton) for core LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for modern GPU and accelerator architectures.
  • Design and think through compute primitives to reduce memory bandwidth bottlenecks and improve kernel compute efficiency.
  • Contribute to infrastructure stability and scalability, ensuring reproducibility, consistency across precision formats, and high utilization of compute resources.

Benefits

  • Competitive salary and equity in a rapidly growing startup.
  • Access to the latest GPU hardware and cloud resources
  • Flexible vacation and paid time off (PTO).
  • Health, dental, and vision insurance.
  • A collaborative and inclusive culture

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

Ph.D. or professional degree

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service