About The Position

The ML and Data Infrastructure team is responsible for building the foundational infrastructure that powers frontier AI models and truth-seeking agents—from petabyte-scale data acquisition and multimodal crawling, to web-scale search/retrieval systems, reliable high-throughput inference serving, low-level GPU/kernel optimizations, compiler/runtime innovations, and high-speed interconnect fabrics for massive clusters. In this role, you will collaborate across pre-training, multimodal, reasoning, and product teams in a fast-paced, meritocratic environment where you will tackle ambiguous, high-stakes problems with first-principles thinking and rigorous execution.

Requirements

  • Strong systems engineering skills with proven impact on large-scale distributed infrastructure (data processing, search, inference, or cluster networking).
  • Proficiency in Python and at least one compiled language (Rust, C++, Go, Java); experience building bespoke libraries, optimizing performance, and debugging complex systems.
  • Hands-on experience with at least one key area: petabyte-scale data pipelines/crawling (Spark/Ray/Kubernetes), web-scale search/retrieval (vector DBs, ranking, RAG), inference optimization (SGLang, kernels, batching), compiler features (JAX/XLA), or high-speed interconnects (optical/copper, SerDes, signal integrity).job
  • Deep understanding of distributed systems challenges: high-throughput ops/sec, latency/throughput tradeoffs, fault-tolerance, monitoring, and scaling to production billions-of-users or 100k+ GPUs.
  • Passion for AI infrastructure: keeping up with SOTA techniques, first-principles problem-solving, meticulous organization/bookkeeping, and delivering rigorous, high-quality results.

Nice To Haves

  • Experience with multimodal data (images/video/audio), epistemics/truth-seeking in retrieval, or agentic systems (long-horizon reasoning, feedback loops).
  • Low-level optimizations: CUDA kernel development (Tensor cores, attention), GPU profiling (Nsight), low-precision numerics, or interconnect pathfinding (LPO/LRO/CPO, photonics).
  • Production expertise in inference reliability (0% error target), CI/CD for ML, or cluster networking (topology, vendor collaboration, failure root-cause).
  • Track record owning end-to-end projects in hyperscale environments, with strong debugging, vendor management, or open-source contributions (e.g., SGLang).

Responsibilities

  • Design, build, and operate petabyte-to-exabyte scale distributed systems for data acquisition, web crawling, preprocessing, filtering/classification, and multimodal pipelines (CPU/GPU workloads).
  • Architect high-performance search/retrieval engines (vector/hybrid/semantic) at trillion-document scale, integrating with LLMs/agents for truth-seeking, low-hallucination reasoning, and real-time knowledge access.
  • Develop reliable inference serving infrastructure: load balancing, autoscaling, KV cache, batching, fault-tolerance, monitoring (Prometheus/Grafana), CI/CD (Buildkite/ArgoCD), and benchmarking for 100% uptime and optimal tail latency.
  • Optimize low-level performance: CUDA kernels (GeMM, attention), Triton/CUTLASS extensions, quantization/distillation/speculative decoding, GPU memory hierarchy, and model-hardware co-design for next-gen architectures.
  • Innovate on compilers/runtimes (JAX/XLA/MLIR, custom features for Hopper/Blackwell), distributed profiling/debugging tools, and interconnect fabrics (copper/optical, 1.6T+, SerDes/photonics, topology simulation, vendor roadmaps).
  • Manage complex workloads across clouds/clusters: orchestration (Kubernetes), data bookkeeping/verifiability, high-speed interconnect validation, failure analysis, and telemetry/automation for production reliability.

Benefits

  • Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service