Data Center Engineer

TEKsystemsSanta Clara, CA
Onsite

About The Position

NVIDIA is seeking a highly experienced Data Center Operations Engineering Lead to serve as the on‑site operational owner for a critical data center location. This role is the operational backbone of the site—responsible for ensuring infrastructure reliability, uptime, compliance, and readiness to support production workloads across NVIDIA’s rapidly growing global data center footprint. This is a hands‑on, high‑impact role for a senior engineer who thrives in mission‑critical environments, owns issues end‑to‑end, and drives operational excellence through strong technical judgment, disciplined processes, and cross‑functional leadership. You will act as the primary on‑site authority and escalation point while partnering with centrally managed engineering, facilities, network, security, and capacity planning teams. Being able to track and report on continuous areas of improvement is key for the DC to continue to progress.

Requirements

  • Data center
  • data center operations
  • data center maintenance
  • Hardware troubleshooting
  • Troubleshooting
  • Infrastructure
  • cooling systems
  • Power
  • PDU
  • data center mgr
  • Data Center Facilities
  • Rack and stack
  • Strong operational judgment, prioritization, and organizational skills.
  • Excellent written and verbal communication skills, including executive‑level incident communication.
  • Ability to operate independently on‑site while collaborating with distributed teams and off‑site managers.
  • Experience with ITIL frameworks, change management, vendor SLAs, and compliance standards.

Nice To Haves

  • Experience supporting high‑density or liquid‑cooled GPU, AI, or HPC environments.
  • Prior ownership or leadership of data center compliance audits.
  • Scripting or automation experience (Python, Bash, etc.) to improve operational efficiency.

Responsibilities

  • Own day‑to‑day operational health of the assigned data center site.
  • Serve as the primary on‑site escalation point for operational, infrastructure, and facilities issues.
  • Lead incident response, triage, escalation, and resolution to maintain high availability and uptime.
  • Coordinate with internal teams, vendors, colocation providers, and Facilities Operations Centers (FOC) during incidents and maintenance events.
  • Ensure infrastructure readiness for new site turn‑ups, expansions, and post go‑live stabilization.
  • Inherit newly built lab or data center environments after buildout and transition them to steady‑state operations.
  • Govern infrastructure changes including installs, upgrades, retrofits, and decommissions with appropriate change management and rollback planning.
  • Maintain deep operational knowledge of critical systems: power distribution, cooling (air and liquid), networking, space, and rack density.
  • Manage and track preventative maintenance schedules for power, cooling, network, and compute infrastructure.
  • Monitor and manage site capacity (power, cooling, space, racks) and identify constraints and risks.
  • Maintain accurate asset inventories and track lifecycle from deployment through decommissioning using DCIM tools.
  • Develop, document, and continuously improve SOPs, runbooks, escalation workflows, and site readiness checklists.
  • Lead ITIL‑aligned change management and operational governance processes.
  • Track and report site‑level operational metrics; analyze trends to drive reliability and service improvements.
  • Identify opportunities to automate operational tasks and improve tooling and visibility.
  • Act as the local liaison between facilities, engineering, networking, security, capacity planning, and compliance teams.
  • Ensure physical and logical access controls are enforced and compliant.
  • Maintain audit readiness and support compliance efforts (e.g., SOC 2, ISO 27001, safety and regulatory certifications).
  • Manage relationships with vendors, service providers, and colocation partners, including SLAs and contracts.

Benefits

  • Medical, dental & vision
  • Critical Illness, Accident, and Hospital
  • 401(k) Retirement Plan – Pre-tax and Roth post-tax contributions available
  • Life Insurance (Voluntary Life & AD&D for the employee and dependents)
  • Short and long-term disability
  • Health Spending Account (HSA)
  • Transportation benefits
  • Employee Assistance Program
  • Time Off/Leave (PTO, Vacation or Sick Leave)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service