Senior AI Infrastructure Engineer

WEX•Boston, MA

1d•Remote

About The Position

We are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development. You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.

Requirements

5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure.
Production Expertise: Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems.
Hardware Fluency: Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads.
Serving Proficiency: Experience deploying and scaling open-source LLMs and embedding models using containerized solutions.
Automation First: Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash.
Core Engineering: Expert proficiency in Python and Go; comfortable digging into lower-level system performance.
Orchestration & Containers: Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes.
Infrastructure as Code: Advanced skills with Terraform, CloudFormation, or Pulumi.
Model Serving: Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or TorchServe.
Cloud Platforms: Deep expertise in AWS (EC2, EKS, SageMaker) or GCP, specifically regarding GPU instance types and networking.
Observability: Proficiency with Prometheus, Grafana, DataDog, and tracing tools (OpenTelemetry).
Networking: Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC).

Nice To Haves

Experience with Ray or Slurm is a huge plus.

Responsibilities

Platform Architecture: Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving.
Inference Optimization: Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token.
Compute Orchestration: Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs.
Operational Excellence (MLOps): Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning.
Reliability & Observability: Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure.
Developer Experience: Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure.
Security & Compliance: Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.