Principal Site Reliability Engineer - AI Infrastructure Operations

NscaleSeattle, WA
$150,000 - $2,150,000Remote

About The Position

Nscale is seeking a Principal Site Reliability Engineer (SRE) to provide technical leadership within their AI Infrastructure Operations domain. This senior role is focused on establishing reliability strategy, developing foundational systems, and driving improvements across the organization. The Principal SRE will act as a technical authority for reliability, automation, and operational architecture for Nscale’s GPU, network, and control-plane platforms. The AI Infrastructure Operations team is responsible for the reliability and scalability of demanding AI platforms, valuing engineers who think in systems, lead through influence, and promote operational excellence.

Requirements

  • 10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating complex, large-scale infrastructure
  • Expert-level software engineering skills, with a strong track record of building production-grade automation and systems
  • Deep expertise in Linux, networking, and distributed systems design at scale
  • Extensive experience debugging and resolving failures across hardware, OS, networking, and application layers
  • Proven ability to lead technical initiatives across teams without direct authority
  • Strong systems-thinking mindset, with the ability to balance reliability, velocity, and cost

Nice To Haves

  • Deep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM)
  • Experience designing observability systems for high-cardinality, high-throughput environments
  • Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures
  • A history of driving step-change improvements in reliability, scalability, or operational efficiency

Responsibilities

  • Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure
  • Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
  • Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams
  • Acting as a senior technical escalation point during critical incidents, guiding resolution and ensuring systemic fixes
  • Identifying structural reliability risks and driving cross-functional initiatives to address them at the architectural level
  • Partnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturity
  • Mentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practices
  • Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability

Benefits

  • Highly competitive package (base + equity) with reviews every 12 months.
  • Flexible workplace
  • Remote-first team
  • Medical
  • Dental
  • Vision
  • Flexible paid time off
  • Parental leave
  • Retirement plan participation
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service