Principal Network Architect- AI Infrastructure

Nscale

About The Position

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future. Nscale is seeking a Network Architect Engineer to lead the evolution, reliability, and operational excellence of our global AI networking infrastructure. This role sits at the core of Nscale’s platform, where network performance directly impacts AI training outcomes. You will act as a technical authority across large-scale RDMA / Infiniband / RoCE fabrics, driving automation, availability improvements, and system-level design across a globally distributed GPU cloud. You will combine deep network protocol-level networking expertise with strong software and automation skills to operate and scale one of the most demanding AI networking environments in the industry.

Requirements

10+ years of experience in network engineering in hyperscale, AI, or HPC environments
Deep expertise in RDMA, Infiniband, and/or large-scale RoCE fabrics
Strong understanding of: RDMA internals and performance tuning, Congestion control and fabric failure modes, Distributed system communication patterns
Expert-level knowledge of data center networking protocols (BGP, OSPF, ECMP)
Proven ability to debug multi-layer issues across network, system, and application layers
Strong programming/scripting skills for automation (Python, Go, etc.)
Experience designing high-scale, highly available network systems
Demonstrated ability to lead complex technical programs without direct authority
Experience acting as a senior escalation point for critical production issues
Strong ability to drive cross-team alignment and execution
Systems-level thinking balancing performance, reliability, scalability, and cost

Nice To Haves

Experience with NVIDIA / Mellanox networking platforms
Familiarity with distributed AI training frameworks and GPU communication patterns
Experience building network observability systems at scale
Background influencing infrastructure strategy in high-growth environments

Responsibilities

Own the technical direction and operational lifecycle management of Nscale’s high-performance RDMA network fabrics
Define long-term architecture, reliability strategy, and operational standards for AI interconnect networks
Lead availability and performance improvement initiatives across globally distributed GPU clusters
Act as a technical authority (SME) across networking, influencing platform-wide decisions
Support design, build, and evolve large-scale Infiniband and RoCE fabrics
Drive deep debugging and resolution of complex cross-layer issues (hardware, firmware, kernel, distributed workloads)
Lead incident response and postmortems, ensuring systemic fixes and long-term improvements
Define and enforce standards across: Congestion control and traffic engineering, Routing (BGP, ECMP, fabric-level routing strategies), Firmware lifecycle and change management, Network observability and telemetry
Develop and scale automation frameworks for network provisioning, validation, and operations
Build tooling to support high-reliability, low-touch network operations at scale
Improve operational efficiency across hundreds of thousands of endpoints and high-throughput links
Lead complex technical initiatives across Network, SRE, Compute, and Platform teams
Serve as technical lead on critical programs, coordinating engineers and stakeholders
Influence product and infrastructure roadmaps based on operational insights and customer needs
Mentor senior engineers and raise the bar for technical rigor and execution

Benefits

Highly competitive package (base + equity) with reviews every 12 months.
Dynamic progression plan tailored to your ambitions.
Flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume