Senior Backend Engineer - Together Cloud

Together AISan Francisco, CA
203d$160,000 - $230,000

About The Position

Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior AI Infrastructure Engineer, you will play a key role in building the next generation AI cloud platform – a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal SaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world.

Requirements

  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired).
  • 5+ years experience writing high-performance, well-tested, production quality code.
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP).
  • Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members.
  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself.
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV.
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN.
  • Experience with Cluster API or similar a big plus.
  • Experience working on high-performance compute, networking, and/or storage a big plus.
  • Experience virtualizing GPUs and/or Infiniband a big plus.
  • Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale.
  • Experience with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD).
  • Experience building IaaS or PaaS systems at scale a plus.
  • Experience with DPUs/SmartNICs a plus.
  • GPU programming, NCCL, CUDA knowledge a plus.

Responsibilities

  • Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning.
  • Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs.
  • Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining.
  • Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining.
  • Perform architecture and research work for decentralized AI workloads.
  • Work on the core, open-source Together AI platform.
  • Create services, tools, and developer documentation.
  • Create testing frameworks for robustness and fault-tolerance.

Benefits

  • Competitive compensation
  • Startup equity
  • Health insurance
  • Flexibility in terms of remote work
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service