ARM-posted 3 months ago
$241,100 - $326,100/Yr
Full-time
Austin, TX
5,001-10,000 employees
Professional, Scientific, and Technical Services

Arm technology is becoming the platform of choice for compute and AI. The Arm System Engineering team's mission is to architect, design, and develop server and rack-level infrastructure for at-scale datacenter deployments. The team capabilities span across system hardware, software, system interconnect, system management, storage, data center infrastructure and performance engineering. The team responsibilities include customer engagements, technology selection, system design, network architecture, performance, and datacenter deployment & operations. The Arm System Engineering team is developing industry-leading technology to deliver innovative and high-performing solutions to power the data centers of the future. We are seeking a hands-on Data Center & Lab Operations Manager to lead the day-to-day operations and break-fix response across two high-performance computing (HPC) labs and data centers in Austin, TX. This role is responsible for managing operational teams, ensuring uptime, and maintaining critical infrastructure that supports advanced AI and HPC workloads - including liquid-cooled and high-power computing systems. This is not a remote position; we need someone on-site in Austin who thrives in a mission-critical environment and brings deep experience managing operational teams, solving complex technical issues, and ensuring safe, reliable, and efficient data center operations.

  • Lead and develop on-site operational teams (technicians and engineers) responsible for maintaining lab and data center infrastructure.
  • Act as the escalation point for all incident response, troubleshooting, and resolution of HPC servers, networking, and liquid-cooled systems.
  • Oversee physical and logical infrastructure, including rack/stack, cabling, network design, power distribution, and advanced cooling systems (air and direct liquid cooling).
  • Ensure maximum system uptime by implementing monitoring, observability, and preventative maintenance practices.
  • Define and enforce operational standards, troubleshooting playbooks, and safety/compliance procedures for high-voltage and liquid-cooled environments.
  • Drive efficiency through automation, tooling, and process optimization across lab and data center operations.
  • Partner closely with engineering, facilities, IT, and leadership teams to align operations with business goals.
  • Oversee hardware lifecycle, including installation, inventory, and decommissioning.
  • 8+ years of data center or lab operations experience, with at least 3+ years in a leadership or management role.
  • Proven success managing on-site teams in a high-uptime, mission-critical environment.
  • Hands-on experience with high-performance computing (HPC), AI clusters, or large-scale infrastructure deployments.
  • Strong background with break-fix, hardware installation, and repair of servers, networking, and power/cooling systems.
  • Familiarity with direct liquid cooling systems and other advanced cooling technologies.
  • Knowledge of incident management, problem management, and ITIL practices.
  • Excellent communication, leadership, and problem-solving skills.
  • Certifications such as CDCMP, ITIL, CCNA, or equivalent.
  • Experience with infrastructure monitoring & observability platforms.
  • Exposure to automation tools for deployment and operations.
  • Bachelor's degree in Computer Science, Engineering, related field or equivalent hands-on experience.
  • The chance to lead operations for cutting-edge AI and HPC systems.
  • A collaborative environment where your expertise makes an immediate impact.
  • Growth opportunities in one of the most advanced computing labs in the world.
  • Access professional growth through complex project involvement and multidisciplinary collaboration.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service