Staff HPC Systems Software Engineer

Nscale•New York, NY

5d•$225,000 - $275,000•Remote

About The Position

Nscale is seeking a Staff HPC Systems Software Engineer to define the technical direction and evolution of a core HPC platform domain. This role operates beyond a single team, shaping how multiple teams build, automate, and run Slurm-based capabilities within Nscale’s wider cloud-native platform. The engineer will work across engineering boundaries to bring coherence to architecture, interfaces, lifecycle models, and operational approaches, partnering closely with teams working on platform tooling, infrastructure APIs, identity systems, and Kubernetes-adjacent systems. This is a high-impact staff-level role for someone who combines deep hands-on software engineering with strong systems judgment. The work will ensure Nscale’s HPC services are robust, supportable, and maintainable, while creating leverage through shared patterns, reusable implementations, and clear technical direction across ambiguous, business-critical problem spaces.

Requirements

Extensive experience designing and building production software and automation for HPC systems, especially Slurm-based environments.
Strong track record of writing maintainable, testable, and resilient software in Go, Python, or similar languages.
Proven ability to define technical direction across a domain spanning multiple teams or services.
Strong understanding of Slurm internals, scheduler behaviour, cluster lifecycle concerns, and operational trade-offs.
Strong practical understanding of GPU-backed infrastructure and HPC networking, including InfiniBand, RoCE, RDMA, and performance-sensitive workload characteristics.
Experience integrating HPC systems with cloud-native platforms, APIs, or service delivery models.
Experience creating engineering leverage through standards, reusable patterns, shared tooling, and architectural clarity.
Strong judgement in balancing short-term delivery with long-term platform health and supportability.
Strong written and verbal communication skills, with the ability to align multiple teams around a coherent technical direction.

Nice To Haves

Experience with other schedulers or batch systems such as Kueue is valuable.

Responsibilities

Own and evolve the technical direction for a defined HPC systems domain, such as Slurm platform architecture, scheduler integrations, cluster lifecycle, workload environments, or service automation.
Make architectural decisions that balance software quality, operational realities, customer needs, and long-term maintainability.
Define how proven Slurm implementations should be packaged, automated, and exposed as a service.
Resolve ambiguity around ownership, interfaces, lifecycle boundaries, and operating models across teams.
Act as the technical escalation point for the most complex issues within the domain.
Establish shared patterns and standards for automation, service lifecycle management, observability, reliability, and supportability across the HPC platform.
Drive cross-team design for integrations between Slurm, Kubernetes-adjacent systems, infrastructure APIs, identity systems, and platform tooling.
Create reusable modules, automation, deployment patterns, and reference implementations that increase engineering leverage.
Identify and correct avoidable technical divergence, duplicated effort, and fragile operating models.
Ensure domain designs reflect the realities of GPU scheduling, HPC networking, performance isolation, and production operations.
Lead technically critical initiatives spanning 2–4 teams or a defined HPC platform area.
Unblock delivery by clarifying technical direction and reducing ambiguity in complex system design problems.
Contribute hands-on where needed to de-risk or accelerate critical work.
Influence engineering teams without formal authority through strong judgement, design clarity, and practical solutions.
Partner with adjacent cloud-native software engineers so HPC implementations build on shared platform patterns rather than separate ones.