ARM-posted 4 months ago
$241,100 - $326,100/Yr
Full-time • Manager
Hybrid • Austin, TX

Arm technology is becoming the platform of choice for compute and AI. The Arm System Engineering team's mission is to architect, design, and develop server and rack-level infrastructure for at-scale datacenter deployments. The team capabilities span across system hardware, software, system interconnect, system management, storage, data center infrastructure and performance engineering. The team responsibilities include customer engagements, technology selection, system design, network architecture, performance, and datacenter deployment & operations. The Arm System Engineering team is developing industry-leading technology to deliver innovative and high-performing solutions to power the data centers of the future.

  • Lead and develop on-site operational teams (technicians and engineers) responsible for maintaining lab and data center infrastructure.
  • Act as the escalation point for all incident response, troubleshooting, and resolution of HPC servers, networking, and liquid-cooled systems.
  • Oversee physical and logical infrastructure, including rack/stack, cabling, network design, power distribution, and advanced cooling systems (air and direct liquid cooling).
  • Ensure maximum system uptime by implementing monitoring, observability, and preventative maintenance practices.
  • Define and enforce operational standards, troubleshooting playbooks, and safety/compliance procedures for high-voltage and liquid-cooled environments.
  • Drive efficiency through automation, tooling, and process optimization across lab and data center operations.
  • Partner closely with engineering, facilities, IT, and leadership teams to align operations with business goals.
  • Oversee hardware lifecycle, including installation, inventory, and decommissioning.
  • 8+ years of data center or lab operations experience, with at least 3+ years in a leadership or management role.
  • Proven success managing on-site teams in a high-uptime, mission-critical environment.
  • Hands-on experience with high-performance computing (HPC), AI clusters, or large-scale infrastructure deployments.
  • Strong background with break-fix, hardware installation, and repair of servers, networking, and power/cooling systems.
  • Familiarity with direct liquid cooling systems and other advanced cooling technologies.
  • Knowledge of incident management, problem management, and ITIL practices.
  • Excellent communication, leadership, and problem-solving skills.
  • Certifications such as CDCMP, ITIL, CCNA, or equivalent.
  • Experience with infrastructure monitoring & observability platforms.
  • Exposure to automation tools for deployment and operations.
  • Bachelor's degree in Computer Science, Engineering, related field or equivalent hands-on experience.
  • The chance to lead operations for cutting-edge AI and HPC systems.
  • A collaborative environment where your expertise makes an immediate impact.
  • Growth opportunities in one of the most advanced computing labs in the world.
  • Access professional growth through complex project involvement and multidisciplinary collaboration.
  • Join a company committed to diversity and inclusion, where your work matters and drives global progress.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service