Lead, NOC & Incident Management

FluidstackAustin, TX
1d$200,000 - $300,000

About The Position

Fluidstack is seeking a Lead, NOC & Incident Management to build and lead our cross-functional operations center (NOC) and incident management execution function. You’ll shape how Fluidstack detects, triages, and responds to operational events across our entire AI infrastructure portfolio, from datacenter facilities to network backbone to internal platform services. This role demands equal parts operational leadership and technical capability. You’ll build the 24/7 monitoring and triage function, operationalize our incident management framework, and establish the operational culture that enables Fluidstack to meet stringent customer SLAs. Success means Fluidstack’s infrastructure teams stop spending time on operational toil — alert monitoring, carrier ticket management, incident bridge setup, shift coverage gaps — and instead focus on engineering and reliability work. You’re the person who ensures someone is always watching the glass, incidents are handled consistently, and post-incident learning actually happens.

Requirements

  • Proven NOC/Operations Center Leadership: 5+ years in network operations, infrastructure operations, or site reliability roles with significant experience running and building a NOC, operations center, or equivalent 24/7 monitoring function. You’ve built shift models, managed MSP relationships, and know how to turn a collection of monitors into a high-performing operational team. Ideally, you’ve done this at global scale.
  • Incident Management Expertise: Deep experience with structured incident response processes — severity classification, escalation matrices, incident bridges, post-incident reviews, and RCA workflows. You’ve been an Incident Manager or Incident Commander for major incidents and you know what good looks like under pressure. You understand that incident management is a skill that requires training, practice, and continuous refinement.
  • Technical Credibility Across Domains: You don’t need to be the deepest expert in network engineering, facilities, or systems — but you need enough technical breadth to triage alerts intelligently, ask the right questions during incidents, and earn the trust of the engineering teams you’ll partner with. Experience with datacenter infrastructure (network, power, cooling) and modern monitoring stacks (Prometheus/VictoriaMetrics, Grafana, AlertManager) is strongly preferred.
  • Process Builder, Not Just Process Follower: You’ve built operational processes from scratch in environments where they didn’t exist before. You know how to design runbooks that contract operators can execute reliably, escalation criteria that are crisp enough to be actionable, and training programs that get new team members productive quickly. You iterate based on real-world feedback, not theoretical perfection.
  • Cross-Team Influence: Exceptional at building partnerships across functional teams without direct authority. You’ve navigated the dynamics of getting engineering teams to write runbooks, participate in on-call rotations, and take post-incident actions seriously. You lead through credibility, follow-through, and consistent operational excellence rather than organizational hierarchy.
  • Customer SLA Mindset: You understand that operational metrics aren’t just internal targets — they’re the foundation of customer trust. You’ve worked in environments with stringent SLAs and you know how to build the operational discipline required to consistently meet them. You think about every process decision through the lens of “what happens when this matters at 2 AM?”

Nice To Haves

  • Hyperscale or Large-Scale Infrastructure Background: Experience operating NOC/operations centers at hyperscale companies (Meta, Google, Microsoft, AWS), large telcos, or major AI infrastructure providers. You’ve seen what mature operations looks like at scale and can adapt those patterns to a fast-growing startup.
  • Incident Management Tooling: Hands-on experience with incident management platforms (incident.io, PagerDuty, Opsgenie, ServiceNow) including configuration of escalation policies, on-call schedules, and alert routing. Bonus if you’ve led a platform migration or stood up a new instance from scratch.
  • MSP/Vendor Management: Experience selecting, onboarding, and managing managed service providers for NOC or operations functions. You’ve written SOWs, negotiated SLAs, and managed the transition from outsourced to internal operations.
  • Facilities & BMS Familiarity: Exposure to datacenter facilities operations — power distribution, cooling systems, CDUs, BMS/SCADA alerting. You don’t need to be a mechanical engineer, but understanding facilities alert triage is valuable since Facilities is the MVP domain for the NOC.
  • Carrier & ISP Operations: Experience managing carrier relationships, circuit troubleshooting, and vendor ticket workflows. Familiarity with carrier NOC processes, circuit ID management, and SLA enforcement.
  • Startup Experience: You’ve built something from scratch before — ideally in a high-growth infrastructure or cloud company. You’re comfortable with rapid context switching, evolving requirements, and the intensity of early-stage company building.

Responsibilities

  • NOC Build & Operations: Stand up the cross-functional operations center from scratch. Assist in selecting and onboard an MSP partner for Tier 1 coverage. Build staffing models, handoff processes, KPIs, and quality standards. Own the single question: “is someone qualified watching every alert, 24/7?”
  • Incident Management Execution: Create, deploy and operationalize Fluidstack’s incident management framework. Manage the Incident Manager on-call rotation. Train engineers on incident roles. Run incident bridges during SEV0/SEV1 events. Ensure post-incident reviews happen on schedule and action items actually close. Partner with the Program Manager (process owner) to continuously improve the framework based on real-world execution.
  • Operational Readiness: Own the “are we ready?” question for every new domain onboarded to the NOC. Drive runbook quality assurance with functional teams. Plan and execute tabletop exercises. Coordinate with the Platform team on incident.io tooling workflows. Onboard new infrastructure domains (Facilities, Network, Systems) into NOC coverage on a phased schedule aligned with datacenter launches.
  • Cross-Functional Orchestration: Build tight operational partnerships with Network Ops, DC Ops, Systems/Platform, and Security teams. Define clear Tier 1 → Tier 2 escalation criteria for each domain. Ensure the NOC acts as a force multiplier for engineering teams by absorbing monitoring, triage, vendor ticket management, and incident coordination.
  • Vendor & Carrier Ticket Lifecycle: Establish processes for the NOC to manage the full lifecycle of carrier and vendor tickets — creation, tracking, SLA enforcement, escalation. Work with Network Ops and DC Ops to define ticket templates, escalation triggers, and vendor communication standards. Ensure no ticket falls through the cracks and every carrier/vendor interaction is documented.
  • Metrics & Continuous Improvement: Establish operational metrics (MTTA, MTTR, escalation rate, false positive rate, runbook coverage) and reporting cadence. Use data to identify patterns, reduce alert noise, improve runbook quality, and drive down incident response times. Produce monthly operational reports for leadership and customer-facing stakeholders.

Benefits

  • Competitive total compensation package (salary + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service