Software Engineer, ML Infrastructure

Realm Labs•Sunnyvale, CA

23d

About The Position

We are hiring a Founding ML Infrastructure Engineer to own the end-to-end deployment, optimization, and operation of our suits of models in production. This is a core founding role focused on building and operating production-grade LLM systems. You will apply deep knowledge of model internals to deploy, optimize, and run modern LLMs at scale, owning performance end-to-end across latency, throughput, and reliability. You will design and operate the full ML serving stack from model artifacts to GPU execution, and work closely with Product and ML teams to ensure our models can support high QPS, strict SLAs, and production correctness. This role is ideal for someone who deeply understands how LLMs work internally, but chooses to specialize in making them fast, stable, and production-ready. About Realm Labs Realm Labs is an AI trust and security startup. We help enterprises detect, debug, and prevent AI’s misbehaviors in production. We are backed by top VCs and serve some of the most iconic global enterprises.

Requirements

5+ years of professional experience in ML infrastructure, systems engineering, or production ML roles.
Strong software engineering fundamentals; ability to write robust, maintainable production code.
Deep hands-on experience with LLM inference infrastructure, including: PyTorch (required)
TensorFlow (working knowledge)
Proven experience with GPU inference optimization, including: TensorRT / TensorRT-LLM
vLLM
Triton Inference Server
SGLang or similar serving runtimes
Strong understanding of LLM internals, such as: Transformer architectures
Attention and KV caching
Batching, streaming, and token-level generation
Experience running ML systems in production with high traffic and SLAs
Comfortable working in Linux-based, cloud production environments

Nice To Haves

Experience deploying LLMs on Kubernetes and GPU clusters.
Familiarity with CUDA, NCCL, or low-level GPU performance concepts.
Experience with: Model sharding and parallelism strategies
Multi-GPU inference
Streaming inference systems
Knowledge of observability for ML systems (metrics, latency breakdowns, GPU monitoring).
Experience working at startups or owning systems with minimal abstraction layers.

Responsibilities

Own the end-to-end LLM inference stack, including: Model loading and execution
GPU utilization and memory efficiency
Runtime performance tuning
Production deployment and scaling
Design and operate high-performance LLM serving systems using technologies such as: vLLM, TensorRT / TensorRT-LLM, Triton Inference Server, SGLang
Optimize inference across: Latency
Throughput (QPS)
GPU memory footprint
Cost efficiency
Work hands-on with PyTorch and TensorFlow models, including: Model graph understanding
Attention mechanisms, KV cache behavior, batching strategies
Precision tradeoffs (FP16, BF16, INT8, etc.)
Build and maintain production-grade GPU services: Multi-model serving
Autoscaling strategies
Fault isolation and graceful degradation
Collaborate with application and platform teams to: Define serving APIs
Ensure correctness and safety of outputs
Debug production issues end-to-end
Build a reproducible model training and versioning system for customer deployments
Establish best practices for: Model versioning
Rollouts and rollbacks
Performance benchmarking
Production validation

Benefits

Market aligned compensation and benefits
Founding engineer equity (Equity is a significant component of this role and will be discussed)
Medical, Dental, Vision, Life insurance, 401-K, In-office lunch etc.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume