About The Position

RadixArk is hiring a Member of Technical Staff — CI Engineer to own the infrastructure that keeps SGLang moving. Our CI system runs 300+ GPU tests across NVIDIA, AMD, Intel, and Ascend hardware pools, gating every commit to one of the fastest-growing open-source LLM inference engines. When CI is green and fast, 100+ contributors ship with confidence. When it isn't, the entire project stalls. That bottleneck is your problem to solve. You won't just maintain pipelines — you'll architect them. You'll replace brittle static thresholds with regression-based detection, harden runners against supply-chain attacks from fork PRs, and cut cycle times so contributors get feedback in minutes, not hours. You'll work directly with core maintainers, hardware partners, and the open-source community to keep the system that gates every merge request trustworthy, fast, and secure. This is not a role for someone who wants to write CI YAML and walk away. It's for an engineer who treats CI infrastructure the way we treat serving infrastructure — as a system worth designing well.

Requirements

  • 3+ years operating CI/CD at scale (GitHub Actions, Buildkite, Jenkins, GitLab CI, or similar)
  • Deep Linux, Docker, GPU computing knowledge
  • Self-hosted runner management experience
  • Strong Bash and Python
  • Security mindset — CI supply chain risks, fork PR attack vectors, runner hardening
  • NVIDIA GPU drivers, CUDA, NCCL, InfiniBand/RDMA experience in CI contexts
  • Familiarity with ML inference workloads (model loading, KV cache, quantization)

Nice To Haves

  • Large open-source project CI experience (100+ contributors)
  • AMD ROCm or Intel XPU CI pipelines

Responsibilities

  • Own CI reliability end-to-end — triage failures, distinguish real regressions from flaky tests and infra issues, keep main green
  • Build regression-based CI — replace hardcoded static thresholds with automated baseline comparison (metrics pipeline, durable storage, detection logic)
  • Harden runner infrastructure — ephemeral runners, container isolation, security hardening for fork PR execution
  • Cut CI time — right-size eval suites, deduplicate server startups, separate PR smoke tests from nightly full runs
  • Improve developer experience — faster feedback, clearer failure messages, workflow orchestration

Benefits

  • We offer competitive base with meaningful equity, comprehensive health benefits, and flexible work arrangements.
  • Compensation is determined by location, level, and experience.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service