Site Reliability Engineer

Nscale

10d•$100,000 - $170,000•Hybrid

About The Position

About Nscale Nscale is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises. We enable organizations to accelerate innovation, reduce the complexity of AI development, and achieve meaningful business outcomes through scalable, sustainable compute. Our culture is defined by ownership, accountability, and rapid innovation. We operate with urgency and transparency, and every team member contributes to building the infrastructure powering the future of AI.

Requirements

2–5 years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering in Data Center Environment
2+ years programming skills (e.g., Python, Go, or similar) with interest in automation and tooling
Working knowledge of Linux systems, networking concepts, and distributed systems
Experience troubleshooting system or application issues in production environments
Familiarity with monitoring or observability tools (e.g., logs, metrics, dashboards)
Strong willingness to learn and improve reliability and operational practices
Ability to work in fast-paced environments and collaborate across teams

Nice To Haves

Exposure to cloud platforms, Kubernetes, or virtualized/bare-metal environments
Experience in AI, GPU workloads, or high-performance computing (HPC)
Basic understanding of high-performance networking concepts (e.g., InfiniBand, RDMA)
Exposure to production monitoring or alerting systems at small or medium scale

Responsibilities

Help build and improve automation, tooling, and infrastructure that supports AI workloads
Support the development of operational systems and platform services
Assist in defining and maintaining basic SLOs/SLIs and monitoring dashboards
Participate in incident response, troubleshooting, and post-incident reviews
Investigate and help resolve performance and reliability issues across systems
Collaborate with Engineering, Networking, and Infrastructure teams to improve system stability
Contribute to improving availability, scalability, and operational efficiency
Learn from senior engineers and grow your expertise in reliability engineering

Benefits

Highly competitive package (base + equity) with reviews every 12 months.
Dynamic progression plan tailored to your ambitions.
Flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume