Senior Systems Engineer- Network Infrastructure

Nscale

5d•Onsite

About The Position

We are building next-generation AI infrastructure from the ground up. Our mission is to deliver highly performant, reliable, and scalable network clusters purpose-built for large-scale AI training and inference. As a startup, we operate with urgency, ownership, and a bias toward action. We are assembling the foundational infrastructure that will power frontier AI workloads—and we’re looking for engineers who want to build it from zero to scale.

Requirements

5–8+ years in infrastructure engineering, hardware deployment, or data center operations.
Hands-on experience deploying network servers (HGX/DGX or similar platforms).
Experience with high-speed networking (InfiniBand, RoCE, Ethernet fabrics).
Strong Linux systems knowledge.
Experience troubleshooting distributed systems performance issues.
Comfortable working onsite in data center environments as needed.

Nice To Haves

Experience in AI/ML infrastructure or HPC environments.
Familiarity with NCCL, CUDA, RDMA.
Automation experience (Python, Ansible, Terraform, Bash).
Experience in high-density power and cooling environments.

Responsibilities

Execute end-to-end bringup of network nodes and racks from installation to production readiness.
Validate BIOS/BMC/firmware configurations and network health.
Perform rack-level integration including power, cabling, and airflow validation.
Bring up and validate high-speed network fabrics (InfiniBand, RoCE, 100–400G Ethernet).
Configure and validate leaf/spine network connectivity.
Run cluster-wide burn-in and stress testing.
Validate node-to-node performance (NCCL, RDMA, GPUDirect).
Troubleshoot hardware, firmware, and fabric-level issues.
Contribute to automation for provisioning and cluster validation.
Improve deployment playbooks and documentation.
Identify reliability issues early and drive corrective actions.
Help turn ad hoc deployments into repeatable systems.
Work closely with networking, systems software, and data center teams.
Coordinate with hardware vendors to resolve bringup issues.
Support rapid capacity expansion as we scale.