Machine Learning Infrastructure Engineer

TRM LabsSan Fracisco, CA
Remote

About The Position

As a Senior Software Engineer, ML Infrastructure at TRM Labs, you will collaborate with data scientists, engineers, and product managers to design and operate scalable GPU-backed infrastructure that powers TRM’s AI systems. You will work at the intersection of distributed systems, cloud infrastructure, GPU performance engineering, and applied machine learning — building the foundation that enables high-throughput, production-grade ML workloads.

Requirements

  • Bachelor’s degree (or equivalent) in Computer Science or related field.
  • 5+ years of experience building and operating distributed systems or infrastructure in production environments.
  • Experience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP).
  • Deep understanding of high-throughput inference systems, including batching strategies, token throughput optimization, and the trade-offs between latency, throughput, and cost.
  • Experience with one or more ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace Optimum.
  • Experience optimizing GPU load, memory efficiency, and performance bottlenecks in production systems.
  • Familiarity with distributed inference strategies including model parallelism and tensor parallelism.
  • Experience working with Kubernetes or equivalent orchestration systems in cloud environments.

Nice To Haves

  • Familiarity with heterogeneous accelerators (e.g., Inferentia) is a plus.
  • CUDA familiarity and experience debugging GPU-related issues is a plus.

Responsibilities

  • Design and operate GPU cluster infrastructure.
  • Build and manage GPU-backed environments in cloud settings, including orchestration, autoscaling, resource isolation, and workload management across multiple concurrent models and users.
  • Implement and tune serving systems that maximize token throughput, batching efficiency, GPU occupancy, and cost effectiveness across interactive and batch workloads.
  • Support and operationalize model parallelism, tensor parallelism, and other distributed serving patterns for large-scale models.
  • Integrate and optimize acceleration stacks such as TensorRT, ONNX Runtime, vLLM, FlashAttention, and related tooling to improve performance and reduce inference cost.
  • Design systems that manage multiple models, multiple users, and mixed workload types across heterogeneous accelerators (e.g., NVIDIA GPUs, Inferentia), ensuring predictable performance under varying demand.
  • Instrument systems to measure GPU load, memory utilization, batching efficiency, queue depth, and token throughput, and use data to continuously improve performance and reliability.
  • Work closely with infrastructure, ML, and product teams to ensure models transition smoothly from experimentation to production-grade, highly available services.

Benefits

  • TRM moves quickly. We are a high velocity, high ownership team that expects clarity, follow-through, and impact.
  • People who thrive here are energized by hard problems, experimentation, and continuous feedback.
  • If something takes months elsewhere, it will ship here in days.
  • Our work sits at the intersection of AI, national security, and fighting crime.
  • The problems are complex, the stakes are real, and the environment evolves quickly.
  • The pace and intensity of the work reflect the importance of the mission.
  • As a result, the way we operate requires a high level of ownership, adaptability, collaboration, and creative problem-solving.
  • Priorities and targets to change quickly as we experiment and iterate
  • Work that often requires operating with a high degree of ambiguity
  • A high level of personal ownership and accountability
  • Close collaboration across teams and functions
  • Frequent, high-touch communication
  • Creative problem solving and out-of-the-box thinking
  • A pace that rewards urgency, adaptability, and outcomes
  • Meaningful problems, motivated by ambitious goals, and energized by working alongside mission-driven colleagues
  • TRM is a Series C company with $220M in total funding, backed by Blockchain Capital, Goldman Sachs, Bessemer, Y Combinator, Thoma Bravo, and others.
  • Headquartered in San Francisco, TRM operates as a distributed-first company with hubs in Los Angeles, San Francisco, New York, Washington D.C., London, and Singapore.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service