Site Reliability Engineer

Nscale
$100,000 - $170,000Hybrid

About The Position

About Nscale Nscale is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises. We enable organizations to accelerate innovation, reduce the complexity of AI development, and achieve meaningful business outcomes through scalable, sustainable compute. Our culture is defined by ownership, accountability, and rapid innovation. We operate with urgency and transparency, and every team member contributes to building the infrastructure powering the future of AI.

Requirements

  • 2–5 years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering in Data Center Environment
  • 2+ years programming skills (e.g., Python, Go, or similar) with interest in automation and tooling
  • Working knowledge of Linux systems, networking concepts, and distributed systems
  • Experience troubleshooting system or application issues in production environments
  • Familiarity with monitoring or observability tools (e.g., logs, metrics, dashboards)
  • Strong willingness to learn and improve reliability and operational practices
  • Ability to work in fast-paced environments and collaborate across teams

Nice To Haves

  • Exposure to cloud platforms, Kubernetes, or virtualized/bare-metal environments
  • Experience in AI, GPU workloads, or high-performance computing (HPC)
  • Basic understanding of high-performance networking concepts (e.g., InfiniBand, RDMA)
  • Exposure to production monitoring or alerting systems at small or medium scale

Responsibilities

  • Help build and improve automation, tooling, and infrastructure that supports AI workloads
  • Support the development of operational systems and platform services
  • Assist in defining and maintaining basic SLOs/SLIs and monitoring dashboards
  • Participate in incident response, troubleshooting, and post-incident reviews
  • Investigate and help resolve performance and reliability issues across systems
  • Collaborate with Engineering, Networking, and Infrastructure teams to improve system stability
  • Contribute to improving availability, scalability, and operational efficiency
  • Learn from senior engineers and grow your expertise in reliability engineering

Benefits

  • Highly competitive package (base + equity) with reviews every 12 months.
  • Dynamic progression plan tailored to your ambitions.
  • Flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
  • medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service