Staff MLOps Engineer – LLMOps

TRM Labs
4d$220,000 - $240,000

About The Position

The AI Engineering Team is chartered with enabling next-generation AI applications, with a special focus on Large Language Models (LLMs) and agentic systems. Our mission is to build robust pipelines, high-performance infrastructure, and operational tooling that allow AI systems to be deployed with speed, safety, and scale. We manage petabyte-scale pipelines, serve models with millisecond-level latency, and provide the observability and governance needed to make AI production-ready. We’re also deeply involved in evaluating and integrating cutting-edge tools in the LLM and agent space — including open-source stacks, vector databases, evaluation frameworks, and orchestration tools that unlock TRM’s ability to innovate faster than the market. As a Staff MLOps Engineer with a focus in LLMOps, you’ll be at the core of building and scaling the technical infrastructure for AI/ML systems. You will:

Requirements

  • Write high-quality, maintainable software — primarily in Python, but we value engineering ability over language familiarity.
  • Have a strong background in scalable infrastructure, including:
  • Containerization and orchestration (e.g. Docker, Kubernetes)
  • Infrastructure-as-code and deployment (e.g. Terraform, CI/CD pipelines)
  • Monitoring and logging frameworks (e.g. Datadog, Prometheus, OpenTelemetry)
  • Understand and implement ML Ops best practices, including:
  • Model versioning and rollback strategies
  • Automated evaluation and drift detection
  • Scalable model and agent serving infrastructure (e.g. vLLM, Triton, BentoML)
  • Deploy and maintain LLM and agentic workflows in production, including:
  • Monitoring cost, latency, and performance
  • Capturing traces for analysis and debugging
  • Optimizing prompt/response flows with real-time data access
  • Demonstrate strong ownership and pragmatism, balancing infrastructure elegance with iterative delivery and measurable impact.

Responsibilities

  • Build reusable CI/CD workflows for model training, evaluation, and deployment — integrating Langfuse, GitHub Actions, and experiment tracking, etc.
  • Automate model versioning, approval workflows, and compliance checks across environments.
  • Build out a modular and scalable AI infrastructure stack — including vector databases, feature stores, model registries, and observability tooling.
  • Partner with engineering and data science to embed AI models and agents into real-time applications and workflows.
  • Continuously evaluate and integrate state-of-the-art AI tools (e.g. LangChain, LlamaIndex, vLLM, MLflow, BentoML, etc.).
  • Drive AI reliability and governance, enabling experimentation while ensuring compliance, security, and uptime.
  • Build and enhance AI/ML Model Performance
  • Ensure data accuracy, consistency and reliability, leading to better model training and inferencing
  • Deploy infrastructure to support offline and online evaluation of LLMs and agents — including regression testing, cost monitoring, and human-in-the-loop workflows.
  • Enable researchers to iterate quickly by providing sandboxes, dashboards, and reproducible environments.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

101-250 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service