Principal MLOps Engineer

Raft Company WebsiteSan Antonio, CO
$150,000 - $200,000Remote

About The Position

Raft is building mission-critical AI and data platforms for the Department of Defense (DoD). Our systems ingest and process massive volumes of real-time data from hundreds of sensors and operational sources, transform that data into usable intelligence, and deliver it to operators through mission applications and common operational pictures that support time-sensitive decision-making. Our platform operates at scale, processing billions of events per day with low-latency data pipelines and cloud-native infrastructure. As Raft expands its AI capabilities, we are investing in a more mature end-to-end machine learning platform to support model development, evaluation, deployment, monitoring, and lifecycle management across both cloud and constrained operational environments. In this role, you will help design, deploy, and mature Raft’s ML platform and MLOps infrastructure. You will work across Kubernetes-based deployment environments, GPU-enabled infrastructure, model serving systems, CI/CD pipelines, and secure production operations to enable rapid and reliable delivery of machine learning capabilities. This role is ideal for someone who understands both the infrastructure needed to run ML systems in production and the practical needs of ML engineers building and deploying models.

Requirements

  • 7+ years of relevant hands-on experience in software engineering, platform engineering, DevOps, MLOps, or related technical roles
  • 5+ years of experience with Docker and Kubernetes in production environments
  • 5+ years of experience supporting enterprise cloud infrastructure or applications in AWS, Azure, or similar environments
  • Strong experience provisioning, operating, and troubleshooting Kubernetes clusters in production
  • Experience building and maintaining machine learning platforms, infrastructure, or pipelines used by engineering or data science teams
  • Practical experience deploying machine learning workloads on Kubernetes
  • Experience managing clusters or workloads that use GPUs
  • Strong understanding of Helm and Kubernetes deployment patterns
  • Strong scripting or programming skills, preferably in Python
  • Experience with modern software engineering practices including Git, CI/CD, DevOps, and Agile/Scrum workflows
  • Strong troubleshooting, systems thinking, and communication skills
  • Ability to work independently and collaboratively in a fast-moving environment
  • Ability to obtain and maintain a Top Secret clearance
  • Ability to obtain Security+ certification within the first 90 days of employment

Nice To Haves

  • Experience with ML model serving and inference platforms such as Triton Inference Server, KServe, Ray Serve, vLLM, or similar technologies
  • Experience with secure and compliant deployment practices in regulated or government environments
  • Experience with Kubernetes-based ML platforms such as Kubeflow
  • Familiarity with service mesh technologies such as Istio
  • Experience provisioning and debugging complex CI/CD systems
  • Experience with infrastructure as code tools such as Terraform
  • Familiarity with software supply chain security, container hardening, vulnerability management, and runtime scanning
  • Experience supporting ML systems across multiple deployment environments, including cloud, on-prem, and edge
  • Background working with machine learning engineers on model training, evaluation, packaging, and release workflows
  • Familiarity with storage and artifact systems used in ML platforms, such as S3-compatible object stores, registries, and metadata/catalog system

Responsibilities

  • Design, build, and maintain secure, scalable MLOps infrastructure and deployment pipelines for production ML systems
  • Help mature Raft’s internal ML platform and model lifecycle capabilities, including model packaging, registry/catalog workflows, deployment, monitoring, and operational support
  • Deploy and manage machine learning workloads on Kubernetes, including GPU-enabled clusters
  • Support model serving and inference infrastructure for a range of ML use cases, including traditional ML, computer vision, speech/audio, and LLM-based systems
  • Build and maintain CI/CD workflows for ML services, model artifacts, and platform components
  • Partner closely with ML engineers, software engineers, and product teams to move models from experimentation to reliable operational deployment
  • Improve observability, reliability, security, and maintainability across ML infrastructure and services
  • Help evaluate and standardize runtime patterns, serving frameworks, and deployment architectures for production ML workloads
  • Contribute to infrastructure decisions across edge, on-prem, and cloud-hosted deployment environments
  • Support compliance-driven deployment practices and secure software supply chain requirements in defense environments
  • Get hands-on with customers at the most forward-leaning places in the Department of War

Benefits

  • Highly competitive salary
  • Fully covered healthcare, dental, and vision coverage
  • 401(k) and company match
  • Take as you need PTO + 11 paid holidays
  • Education & training benefits
  • Annual budget for your tech/gadgets needs
  • Monthly box of yummy snacks to eat while doing meaningful work
  • Remote, hybrid, and flexible work options
  • Team off-site in fun places!
  • Generous Referral Bonuses
  • And More!

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Principal

Education Level

No Education Listed

Number of Employees

101-250 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service