Machine Learning Infrastructure Engineer

TRM Labs•San Fracisco, CA

2d•Remote

About The Position

As a Senior Software Engineer, ML Infrastructure at TRM Labs, you will collaborate with data scientists, engineers, and product managers to design and operate scalable GPU-backed infrastructure that powers TRM’s AI systems. You will work at the intersection of distributed systems, cloud infrastructure, GPU performance engineering, and applied machine learning — building the foundation that enables high-throughput, production-grade ML workloads.

Requirements

Bachelor’s degree (or equivalent) in Computer Science or related field.
5+ years of experience building and operating distributed systems or infrastructure in production environments.
Experience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP).
Deep understanding of high-throughput inference systems, including batching strategies, token throughput optimization, and the trade-offs between latency, throughput, and cost.
Experience with one or more ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace Optimum.
Experience optimizing GPU load, memory efficiency, and performance bottlenecks in production systems.
Familiarity with distributed inference strategies including model parallelism and tensor parallelism.
Experience working with Kubernetes or equivalent orchestration systems in cloud environments.

Nice To Haves

Familiarity with heterogeneous accelerators (e.g., Inferentia) is a plus.
CUDA familiarity and experience debugging GPU-related issues is a plus.

Responsibilities

Design and operate GPU cluster infrastructure.
Build and manage GPU-backed environments in cloud settings, including orchestration, autoscaling, resource isolation, and workload management across multiple concurrent models and users.
Implement and tune serving systems that maximize token throughput, batching efficiency, GPU occupancy, and cost effectiveness across interactive and batch workloads.
Support and operationalize model parallelism, tensor parallelism, and other distributed serving patterns for large-scale models.
Integrate and optimize acceleration stacks such as TensorRT, ONNX Runtime, vLLM, FlashAttention, and related tooling to improve performance and reduce inference cost.
Design systems that manage multiple models, multiple users, and mixed workload types across heterogeneous accelerators (e.g., NVIDIA GPUs, Inferentia), ensuring predictable performance under varying demand.
Instrument systems to measure GPU load, memory utilization, batching efficiency, queue depth, and token throughput, and use data to continuously improve performance and reliability.
Work closely with infrastructure, ML, and product teams to ensure models transition smoothly from experimentation to production-grade, highly available services.

Benefits

TRM moves quickly. We are a high velocity, high ownership team that expects clarity, follow-through, and impact.
People who thrive here are energized by hard problems, experimentation, and continuous feedback.
If something takes months elsewhere, it will ship here in days.
Our work sits at the intersection of AI, national security, and fighting crime.
The problems are complex, the stakes are real, and the environment evolves quickly.
The pace and intensity of the work reflect the importance of the mission.
As a result, the way we operate requires a high level of ownership, adaptability, collaboration, and creative problem-solving.
Priorities and targets to change quickly as we experiment and iterate
Work that often requires operating with a high degree of ambiguity
A high level of personal ownership and accountability
Close collaboration across teams and functions
Frequent, high-touch communication
Creative problem solving and out-of-the-box thinking
A pace that rewards urgency, adaptability, and outcomes
Meaningful problems, motivated by ambitious goals, and energized by working alongside mission-driven colleagues
TRM is a Series C company with $220M in total funding, backed by Blockchain Capital, Goldman Sachs, Bessemer, Y Combinator, Thoma Bravo, and others.
Headquartered in San Francisco, TRM operates as a distributed-first company with hubs in Los Angeles, San Francisco, New York, Washington D.C., London, and Singapore.