Staff Engineer

DigitalOceanSeattle, WA
Hybrid

About The Position

Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you’ll find your place here. We value winning together—while learning, having fun, and making a profound difference for the dreamers and builders in the world. We are seeking a Staff AI Orchestration Engineer to lead the design, optimization, and scaling of our Kubernetes-based AI infrastructure. In this role, you will tackle the unique challenges of massive-scale AI workloads, focusing on throughput, GPU utilization, and fault tolerance to support next-generation distributed training and disaggregated inference.

Requirements

  • Kubernetes Expertise: Deep technical knowledge of Kubernetes core components, API performance optimization, Dynamic Resource Allocation (DRA), and the custom resource definitions (CRDs) required for advanced scheduling.
  • Advanced Scheduling Experience: Proven track record working with AI-specific Kubernetes schedulers and orchestrators such as Kueue, Volcano, Apache YuniKorn, or Run:ai / KAI-Scheduler.
  • Hardware & Topology Acumen: Deep understanding of GPU architectures (NVIDIA and AMD) and interconnects, understanding how hardware topology directly impacts training and inference speeds.
  • Resource Management Skills: Experience balancing performance and cost using Dominant Resource Fairness (DRF), load-aware scheduling, and bin-packing vs. spread strategies to maximize node vacancy or workload resources.
  • Systems Isolation Background: Familiarity with container runtime internals (containerd, runc), rootless containers, and security contexts to manage blast radiuses in shared AI infrastructure.
  • AI/ML Framework Knowledge: Strong understanding of modern LLM serving architectures, prefill-decode disaggregation, and engines like vLLM, Triton, or SGLang.
  • Observability Proficiency: Experience tracking deep infrastructure and inference metrics, including Time To First Token (TTFT), Time Per Output Token (TPOT), GPU memory pressure, and identifying hardware failures like XID errors.

Responsibilities

  • Architect Large-Scale Scheduling: Design and optimize hierarchical, high-throughput scheduling architectures for massive Kubernetes clusters (1,000+ nodes, 10,000+ pods), utilizing techniques like optimistic concurrency, multi-scheduler architectures, and batch dispatching.
  • Maximize GPU Utilization: Eliminate GPU waste in multi-tenant environments by implementing fractional GPU allocation, leveraging mechanisms like KAI-Scheduler's Reservation Pods or hard-isolation tools like HAMi, and configuring time-based fairshare scheduling to balance over-quota pool access.
  • Optimize Placement & Topology: Deploy topology-aware scheduling to align pod placement with physical hardware dimensions, such as NVLink connections, PCIe lanes, and NUMA nodes, minimizing communication latency for multi-GPU operations.
  • Enhance Cluster Performance: Reduce scheduling latency and API server load by tuning etcd, optimizing admission webhooks, and implementing in-place pod resizing (VPA) or in-place container restarts.
  • Secure AI Workloads: Design secure, multi-layered isolation environments and Agent Sandboxes to safely execute untrusted LLM-generated code, utilizing namespaces, Kata Containers, gVisor, or Firecracker microVMs.
  • Manage AI Storage & Fault Tolerance: Orchestrate efficient model weight distribution using OCI Image Volumes and implement Checkpoint/Restore capabilities (via CRIU and NVIDIA cuda-checkpoint) for long-running training fault recovery.
  • Enable Distributed Training: Implement robust gang scheduling to prevent deadlocks in tightly-coupled, multi-node training jobs (e.g., MPI, PyTorch) using tools like Volcano, Kueue, or LeaderWorkerSet (LWS).
  • Orchestrate Complex Inference: Implement and manage disaggregated AI inference pipelines using frameworks like NVIDIA Grove, coordinating multicomponent deployments (e.g., prefill leaders, decode workers, KV routers) with multilevel autoscaling and explicit startup ordering.

Benefits

  • Employee Assistance Program
  • Local Employee Meetups
  • Flexible time off policy
  • Reimbursement for relevant conferences, training, and education
  • Access to LinkedIn Learning's 10,000+ courses
  • Bonus
  • Equity compensation
  • Equity grants upon hire
  • Employee Stock Purchase Program
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service