System Software Engineer, Distributed Systems

NVIDIASanta Clara, CA
$152,000 - $287,500

About The Position

The VLSI Productivity and Infrastructure team supports 1000+ chip design engineers by building tools and platforms that supercharge their everyday work. Our mission: make chip designers faster. We build and operate long shelf-life systems spanning build automation, observability, analytics, automated error detection/remediation, and codebase modernization—with a strong commitment to stability. Our core workflow infrastructure runs as userspace software on bare-metal Linux hosts (no sudo, no containers). We coordinate shared state and artifacts via NFS, launch long-running, compute-heavy workflows on IBM LSF, and provide adjacent services for APIs and observability. This is a high-ownership environment where you'll often be the expert on what you build. We are looking for a pragmatic and versatile systems engineer who enjoys working near the metal and building tools that empower other engineers. This is a generalist role with an emphasis on distributed systems and operational excellence in a “below containers” world: coordination, reliability, performance, and safe evolution of legacy systems (including incremental modernization of large codebases into Go). This isn't a CI/CD pipeline configuration role; you will be writing the userspace software that manages state, concurrency, and reliability at scale.

Requirements

  • B.S. CS/EE (or equivalent experience)
  • 5+ years developing and operating production software in Go and/or Python, ideally in large codebases
  • Strong Linux fundamentals: processes, filesystems, permissions, synchronization/locks, concurrency, and debugging
  • Solid distributed-systems thinking: failures, retries/timeouts, backoff, idempotency, and operational rigor
  • Experience building long-runtime automation or services on shared compute clusters (batch schedulers, build systems)
  • Ability to translate ambitious, high-level goals into a safe delivery plan (instrumentation, staged rollout, measurable outcomes)

Nice To Haves

  • Hands-on experience with shared filesystems at scale (NFS), or coordination patterns on eventually-consistent storage
  • Experience with batch job scheduling, shared compute fleets, or build systems
  • Track record of incremental modernization (tests, shadow runs, canaries, rollback plans)
  • Experience partitioning/optimizing metadata-heavy systems and reducing I/O or R/W hot spots
  • Strong incident/debug tactics: clear root-cause analysis, remediation, and guardrails as well as rapid comprehension and ownership of unfamiliar codebases in any language (including LLM-generated code) to implement high-leverage changes

Responsibilities

  • Design, build, and deliver core components of our next-generation productivity platforms
  • Develop reliable userspace infrastructure for long-running engineering workflows at scale on bare-metal Linux hosts
  • Build state coordination over NFS (atomicity, idempotency/dedup, partial-write recovery, without privileged ops)
  • Build and improve orchestration around IBM LSF (submission/tracking, retries/cancel, log capture, fairness/backpressure)
  • Convert legacy codebases into modern powerhouses using incremental migration techniques (e.g., Perl to Go), with stage gates, parity strategies, and strong observability
  • Debug and improve performance and reliability across Linux and Kubernetes, including operational tooling
  • Collaborate with engineering users to turn ambiguous workflows into durable production systems

Benefits

  • competitive salaries
  • generous benefits package
  • equity
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service