Software Engineer, Workload Enablement

OpenAISan Francisco, CA
2d

About The Position

We’re hiring an SW Engineer to enable production workloads and end-to-end testing on new platforms. This role will include creating new test harnesses and platform stress benchmarks, porting existing inference and training workloads to new, sometimes early-access, systems/hardware, analyzing performance and bottlenecks, and characterizing the end-to-end behavior of new systems (compute, comms, storage, control plane, and failure modes).

Requirements

  • BS in CS/EE (or equivalent practical experience).
  • 5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC.
  • Strong hands-on experience with: PyTorch and modern LLM training/inference stacks Large-scale distributed training concepts (data/model/pipeline parallel, collective comms) Experience with RDMA and debugging/optimizing comms libraries (NCCL or RCCL) and their interaction with hardware/network
  • Proficiency in Python plus comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus).
  • Strong profiling/debugging skills (e.g., Nsight, rocprof, perf, flamegraphs; ability to reason from traces/counters).

Nice To Haves

  • Experience building workload-shaped benchmarks and stress/fault tests that correlate to production behavior (not just synthetic loops or microbenchmarks).
  • Familiarity with RDMA networking and transport tuning; understanding of how network topology and congestion impact collectives.
  • Experience running and validating workloads in Kubernetes, and bridging “research code” into robust, repeatable infrastructure.
  • Hands-on lab experience with early hardware (new NICs, new GPUs/accelerators, early racks).

Responsibilities

  • Port and validate key inference and training workloads on new platforms/SKUs as they arrive; drive correctness, performance, and stability to an internal readiness bar.
  • Build a suite of benchmarks and stress tests that capture real E2E behavior of our workloads by exercising all aspects of a system, including CPU, GPU, memory subsystem, frontend, scale-up, and scale-out networking (including WAN traffic, NVlink and RDMA collectives), storage, thermals, and any other relevant parts.
  • Deep-dive performance on distributed training/inference: Collective performance and tuning (across NCCL/RCCL and internal libraries) Overlap of compute/communication, kernel-level bottlenecks, memory bandwidth and scheduling effects
  • Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs (pass/fail, performance score, regression detection).
  • Partner with systems + fleet bring-up engineers to ensure the platform is not only stable and performant, but also operationally usable and scalable (containerization, K8s integration, telemetry hooks, failure triage loops).
  • Work cross-functionally with vendors and internal stakeholders by producing clear bug reports, minimal repros, and prioritized issue lists.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service