Head of Infrastructure, Stealth Edge AI Co

Montauk Capital•New York, NY

3d•Hybrid

About The Position

We’re building the automation, orchestration, and monitoring layer that unifies disparate metro edge GPU nodes into a single software-managed compute platform. You’ll own the definition, design, implementation, and execution of the hardware and infrastructure buildout, executing strategy across edge data center requirements, GPU selection, supply chain, technical implementation, operational maintenance and deployment as we scale. You’ll take the foundational groundwork and execute across the entire hardware and infrastructure side of our company, transforming our roadmap into production scale compute for AI inferencing. You’ll ensure the GPU clusters deliver on customer requirements, are highly-available, and will be the hands on expert for the hardware side of our business. Most importantly, you’ll turn our high-level plans into real, technical execution, and will play a key role in making supply chain decisions about infrastructure and how we deploy, scale, and support it.

Requirements

Strong infrastructure engineering experience and systems-level technical judgment
Experience deploying or managing compute infrastructure in real-world environments
Experience with data center, hardware, or GPU-based systems implementation
Experience owning GPU provisioning, hardware selection, and systems configuration
GPU scheduling and orchestration specifics: GPU type awareness, memory management, topology considerations, placement strategies for multi-GPU jobs, and fragmentation minimization
Bare-metal provisioning lifecycle: IPMI/Redfish, BMC-based remote management, PXE boot, and automated OS deployment workflows
On-board storage
Observability stack: distributed configuration and troubleshooting, plus monitoring, alerting, and tracing
Deployment planning, Hardware configuration, Operational troubleshooting
Linux systems depth: RHEL/Ubuntu, low-level troubleshooting, shell scripting
Security and operational best practices for bare metal
Deployment tooling at production scale
Networking fundamentals for inference workloads and OOB management
Startup / 0→1 DNA: You ship fast and communicate clearly.

Responsibilities

Own GPU infrastructure design and implementation details from planning through deployment
Own hardware selection, configuration, and deployment across early compute infrastructure
Help turn early technical groundwork into a functioning deployed system
Own the GPU roadmap we use to entice customers and build partnerships
Deploy, operate, and tune GPU clusters for both bare-metal and internal software stack
Own resilient networking implementation from each site to the cluster, including a robust OOB network for constant monitoring and management
Manage deployments at production scale
Interface with site ops on power, cooling, and connectivity
Build the automation and monitoring stack for distributed edge nodes
Own the supply chain for all infrastructure gear
Manage third party hardware vendors on provisioning, maintenance and break-fix support