Cluster Infrastructure Engineer

CartesiaSan Francisco, CA
5dOnsite

About The Position

We’re looking for a Cluster Infrastructure Engineer to help build and scale the compute backbone that powers Cartesia’s research on real-time, multimodal intelligence. In this role, you’ll work at the intersection of distributed systems and infrastructure engineering, designing and operating the large-scale GPU clusters that train and serve Cartesia’s foundation models. You’ll own systems that need to be fast, reliable, and highly automated — ensuring our researchers and product teams can move at the speed of innovation. You’ll build the tooling, automation, and monitoring needed to keep clusters resilient under load, quickly diagnose and resolve issues, and continuously push the boundaries of scalability and efficiency.

Requirements

  • Strong engineering fundamentals and experience building and operating large-scale distributed systems
  • Deep familiarity with HPC & GPU cluster management using Kubernetes and Slurm
  • A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast
  • Ability to balance principled engineering with the urgency of keeping mission-critical systems alive
  • Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)
  • Strong debugging skills— comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults.

Nice To Haves

  • Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar
  • Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism

Responsibilities

  • Design and build large-scale GPU clusters for model training and low-latency inference
  • Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing
  • Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization
  • Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance
  • Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed
  • Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia’s research and production workloads
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service