Senior AI Infrastructure Engineer

WEXBoston, MA
1dRemote

About The Position

We are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development. You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.

Requirements

  • 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure.
  • Production Expertise: Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems.
  • Hardware Fluency: Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads.
  • Serving Proficiency: Experience deploying and scaling open-source LLMs and embedding models using containerized solutions.
  • Automation First: Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash.
  • Core Engineering: Expert proficiency in Python and Go; comfortable digging into lower-level system performance.
  • Orchestration & Containers: Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes.
  • Infrastructure as Code: Advanced skills with Terraform, CloudFormation, or Pulumi.
  • Model Serving: Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or TorchServe.
  • Cloud Platforms: Deep expertise in AWS (EC2, EKS, SageMaker) or GCP, specifically regarding GPU instance types and networking.
  • Observability: Proficiency with Prometheus, Grafana, DataDog, and tracing tools (OpenTelemetry).
  • Networking: Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC).

Nice To Haves

  • Experience with Ray or Slurm is a huge plus.

Responsibilities

  • Platform Architecture: Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving.
  • Inference Optimization: Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token.
  • Compute Orchestration: Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs.
  • Operational Excellence (MLOps): Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning.
  • Reliability & Observability: Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure.
  • Developer Experience: Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure.
  • Security & Compliance: Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.

Benefits

  • health, dental and vision insurances
  • retirement savings plan
  • paid time off
  • health savings account
  • flexible spending accounts
  • life insurance
  • disability insurance
  • tuition reimbursement
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service