About The Position

Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside some of the most experienced and innovative leaders and engineers in the field. Where we work Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team of over 800 employees includes more than 400 highly skilled engineers with deep expertise across hardware and software engineering, as well as an in-house AI R&D team. The Role The Data Center Site Manager owns end‑to-end reliability, safety, capacity, and performance for one of our flagship U.S. sites. You’ll lead a high‑performing, multi‑disciplinary operations team and partner tightly with Design, Build, Network, Security, Capacity Planning, and the DC orgs to deliver world‑class availability and cost efficiency.

Requirements

  • Associate’s degree or trade certification in Electrical/Mechanical/Industrial Engineering (or equivalent experience).
  • 10+ years in electrical/mechanical/HVAC/controls within industrial/commercial settings, 5+ years specifically in data center or mission‑critical facilities.
  • Team leadership experience in 24/7 sites (managing leads/techs, vendors, and on‑call operations).
  • Deep, hands-on knowledge of UPS/generators/switchgear, chillers/CRAC/CRAH, fire detection/suppression, BMS/EPMS/DCIM, and structured cabling (copper & fiber).
  • Proven strength in incident management, RCA/Corrective Actions, change management, and vendor/contract oversight.
  • Data‑driven mindset with the ability to forecast resources and make analytics‑backed decisions (Excel; SQL/scripting a plus).
  • Excellent written/verbal communication with comfort presenting to executives and guiding field teams during live events.
  • Ability to travel up to ~30% and support after‑hours escalations when needed.

Nice To Haves

  • Bachelor’s degree in Electrical/Mechanical/Industrial Engineering, Engineering Management, or Reliability Engineering.
  • Hyperscale/colo experience with reliability‑centered maintenance, predictive analytics, and Lean/Six Sigma practices.
  • Familiarity with Linux fundamentals, network equipment installation/troubleshooting, and fiber optics testing.
  • Experience with Jira, Confluence, ServiceNow (or similar); strong SOP/MOP/EOP authorship.
  • Certifications such as CDCP, DCM, PMP, OSHA‑30, ITIL, or Uptime‑aligned credentials.

Responsibilities

  • Own the site 24/7: deliver continuous availability across power, cooling, structured cabling, network, security, and DCIM—meeting or beating global SLAs.
  • Build and lead the team: hire, mentor, and develop managers/technicians; run staffing models, shift coverage, and on‑call rotations that scale.
  • Be the incident commander: lead major events end‑to-end—triage, communications, executive briefings, RCA, and durable corrective actions.
  • Drive reliability engineering: implement RCM, predictive maintenance, QA/QC, 5S, and Lean/continuous improvement to cut MTTR and raise MTBF.
  • Deliver capacity on time: plan and execute expansions/retrofits; commission MEP systems with Design/Construction; achieve flawless change control (MOP/SOP/EOP).
  • Scale tooling & automation: mature DCIM/BMS/EPMS, monitoring/alerting, work management (Jira/ServiceNow), knowledge base (Confluence), and light scripting/SQL for telemetry and workflow automation.
  • Run a metrics‑first operation: publish dashboards and KPIs (availability, PUE, MTBF/MTTR, work compliance, safety) and use them to drive decisions.
  • Partner across functions: work with Cloud/Compute, Network, Security, and Capacity Planning to optimize performance, cost, and resiliency across the fleet.
  • Manage vendors & colos: own contracts, SLAs, and execution for rack deliveries, PDUs, fiber/copper, and lifecycle PMs; validate colo topology and compliance.
  • Raise the safety bar: enforce a zero‑injury EHS culture; conduct drills/audits for life safety, physical security, and data protection.
  • Forecast and budget: build data‑backed plans for power, spares, headcount, and projects; track OpEx/CapEx with rigor.

Benefits

  • Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
  • 401(k) plan: up to 4% company match with immediate vesting.
  • Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
  • Remote work reimbursement: up to $85/month for mobile and internet.
  • Disability & life insurance: company-paid short-term, long-term and life insurance coverage.
  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Manager

Education Level

Associate degree

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service