Engineering Manager, Cloud Platform

Verdigris•Palo Alto, CA

58d•Hybrid

About The Position

GPU racks pull 120-140 kW each, heading toward 600 kW-1 MW per rack by 2027. Design margins in data centers have compressed from 30% to 10-15%. Standard BMS systems poll at 1-second intervals but GPU workloads ramp in 8 milliseconds. The gap between what operators can see and what they need to see is widening fast. Verdigris closes that gap. We build the electrical intelligence layer for AI infrastructure. Continuous 8 kHz measurement that detects hidden degradation and validates safe operating headroom. Our platform sits between monitoring infrastructure (Schneider, Eaton, Vertiv) and autonomous controls. We are the validation layer that makes both trustworthy. We are building toward a world where data center operators can safely unlock stranded capacity, prevent failures before they cascade, and ultimately enable autonomous power orchestration for AI workloads. About 50 people. Series B. Real customers, real revenue, real hardware deployed in colocation facilities running AI workloads. The cloud platform already processes billions of 8 kHz waveform readings from deployed sensors and turns them into validated operating limits that operators use daily. Today that means reliability and early warning. Tomorrow it means capacity optimization and machine-facing orchestration APIs that GPU schedulers consume directly. We are hiring an Engineering Manager to lead the cloud platform team, the system that makes all three product pillars (Observability, Intelligence, Orchestration) work. You would manage a team of 3-5 engineers, reporting to Jon (co-founder/CTO), with a mandate to grow the team and raise the bar. Here is the situation. The platform works. Customers depend on it. The 8 kHz ingestion pipeline is real and running in production. But the architecture has grown faster than the team's ability to maintain it cleanly, and the org structure has not kept pace with what we need to build next. AI infrastructure spend is projected at $250-650B in capex, and demand for validated electrical intelligence is accelerating with it. We need someone who can take ownership of the platform, organize the team around clear ownership, and raise the quality of how we build and ship, while also building toward the orchestration layer that does not exist yet. This is a player-coach role. You will manage people, set direction, and run the engineering operating cadence. You will also read code, debug production issues, and make architectural calls. If you have not been in a codebase recently, this is not the right fit.

Requirements

You have real technical depth in cloud infrastructure, data systems, or ML platforms. You can review architecture, debug production, and make tradeoffs, not just delegate them.
You have inherited or built a small team before and made it better. Not by replacing everyone, but by setting clear expectations, building ownership, and coaching people up, or making hard calls when coaching was not enough.
You can operate without a clean roadmap. Cross-functional dependencies, incomplete requirements, competing priorities. You turn that into a plan with owners and timelines.
You care about production quality. Observability, incident response, release discipline. You build the habits, not just the systems.
You are genuinely interested in what happens when AI meets physical infrastructure. Our customers run mission-critical facilities where electrical reliability directly determines whether AI workloads stay online. The validation layer we are building does not exist anywhere else. This is new territory.
If you have not been in a codebase recently, this is not the right fit.

Responsibilities

Audit the platform: reliability, scalability, observability, tech debt. Form your own view, not just ours.
Organize team ownership across the three-pillar stack: Observability (ingestion, 8 kHz data pipeline), Intelligence (ML signal processing, validated operating limits), and the APIs and dashboards that deliver them.
Stand up an engineering operating cadence: roadmap reviews, incident reviews, delivery planning, architecture reviews.
Get your hands dirty on the hardest reliability and performance problems. Ship fixes, not just plans.
Identify hiring gaps and start filling them. Raise the bar on who we bring in.