Staff AI Infrastructure Engineer

Nuclearn•Phoenix, AZ

55d•Hybrid

About The Position

Nuclearn.ai builds AI-powered software for the nuclear and utility industries — tools that keep critical infrastructure reliable, efficient, and safe. Our software integrates AI-driven workflow, documentation, and research automation, and is already used at 60+ nuclear reactors across North America. You'll ship production infrastructure operators and engineers rely on every day. We're growing quickly, expanding our team and our Phoenix AI data center. The work is consequential: the infrastructure you build and maintain is the foundation everything else runs on. Eligibility: U.S. citizenship or permanent residency (green card) is required due to DOE export compliance. What You'll Do This is a hands-on infrastructure role. You will physically build, operate, and scale the GPU compute environment that powers our AI platform — not design it from a desk. Build and operate our Phoenix AI data center — Rack and cable GPU servers, configure power distribution, manage cooling and airflow, maintain redundancy, and handle firmware and hardware lifecycle. You own uptime. Plan and execute infrastructure scaling — Spec and procure hardware. Run capacity planning against real workload data. Execute GPU refreshes, storage expansions, and network upgrades with minimal disruption to production. Own the full stack from power to container — Configure bare-metal servers, IPMI/BMC management, OS provisioning, networking (switches, VLANs, cabling), storage, and container runtimes. Troubleshoot across the entire hardware-software boundary. Partner with utility IT teams on customer deployments — Review and validate customer-proposed infrastructure for hosting Nuclearn applications. Identify GPU/runtime mismatches, networking gaps, and configuration issues before go-live. Provide concrete remediation guidance. You will operate as a senior individual contributor with high autonomy and direct influence across engineering, ML, product, and customer environments. Examples of problems you might own in your first 90 days Rack and commission a new GPU node — Receive hardware, plan rack placement for power and thermal constraints, install rails, cable power and networking, configure BMC, provision the OS, validate GPU functionality, and hand off a production-ready machine to the ML team. Develop a hardware requirements standard for both internal and customer-facing deployments — GPU sizing models, storage thresholds, power and cooling requirements, networking specs, and supported configurations. Audit the Phoenix data center end-to-end — Map current power draw against capacity, identify thermal hotspots, assess cable management, review redundancy gaps, and execute targeted upgrades to keep pace with scaling workloads. Validate a utility customer's proposed infrastructure before deployment — catch a GPU/driver mismatch, flag insufficient network throughput, or identify a cooling limitation that would throttle inference performance under load.

Requirements

You've racked servers and managed physical infrastructure — not just in a lab, but in production environments where uptime matters
Hands-on experience with NVIDIA GPU hardware: installation, driver and firmware management
Strong Linux systems administration (bare metal, not just cloud VMs)
Experience with Ceph storage clusters: deployment, tuning, and operations
Experience with Proxmox virtualization for managing compute and storage infrastructure
Working knowledge of data center fundamentals: power distribution, cooling, cabling, rack layout
Experience with network configuration: switches, VLANs, firewall rules, cable management
Familiarity with remote management (IPMI, iDRAC, BMC) and OS provisioning at scale
Hardware procurement experience: speccing systems, working with vendors, managing RMAs
You are hands-on first. You think in systems — from the power circuit to the container orchestrator. You can be the technical authority in the room whether you're talking to our ML engineers about GPU utilization or walking a utility IT director through their rack layout.

Nice To Haves

Experience tuning GPU environments for ML inference and training workloads
Familiarity with AI model serving, RAG pipelines, or LLM deployment
Experience with containerized runtimes (Docker, Kubernetes) for AI workloads
Experience with InfiniBand or other high-speed interconnects (RoCE, GPUDirect RDMA) for distributed AI workloads
Experience in utility IT, energy infrastructure, or other regulated industries
Experience operating on-prem or air-gapped environments
Hardware vendor relationships (NVIDIA, Supermicro, Dell, etc.)
Familiarity with cybersecurity expectations in critical infrastructure environments
Network certifications or deep switching/routing experience

Responsibilities

Build and operate our Phoenix AI data center — Rack and cable GPU servers, configure power distribution, manage cooling and airflow, maintain redundancy, and handle firmware and hardware lifecycle. You own uptime.
Plan and execute infrastructure scaling — Spec and procure hardware. Run capacity planning against real workload data. Execute GPU refreshes, storage expansions, and network upgrades with minimal disruption to production.
Own the full stack from power to container — Configure bare-metal servers, IPMI/BMC management, OS provisioning, networking (switches, VLANs, cabling), storage, and container runtimes. Troubleshoot across the entire hardware-software boundary.
Partner with utility IT teams on customer deployments — Review and validate customer-proposed infrastructure for hosting Nuclearn applications. Identify GPU/runtime mismatches, networking gaps, and configuration issues before go-live. Provide concrete remediation guidance.