IOC Systems Specialist

IRENFort Worth, TX
Onsite

About The Position

IREN is a leading AI Cloud Service Provider, delivering large-scale GPU clusters for AI training and inference. IREN’s vertically integrated platform is underpinned by its expansive portfolio of grid-connected land and data centers in renewable-rich regions across the U.S. and Canada. With 100% renewable energy, we build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the ever-evolving applications of high-performance compute. We believe that human progress is invaluable, but it should be done in the right way – responsibly, sustainably and having a positive impact on the communities we operate in.

Requirements

  • 2–5 years of experience operating or supporting HPC clusters in a production or IOC/NOC environment, including incident triage, change execution, and coordination with engineering teams.
  • Working knowledge of Kubernetes (bare-metal a plus) sufficient to triage workloads, interpret cluster state, and execute documented operational procedures.
  • Operational experience with the Slurm workload manager — job triage, queue and node health checks, and escalation of scheduler issues to engineering.
  • Familiarity with HPC monitoring and observability tooling for alerting, log triage, runbook execution, and proactive issue detection.
  • Demonstrated track record of incident response, root cause analysis, and contributing to operational improvements in complex production systems.
  • Working understanding of cloud platforms (AWS, Azure, or GCP) and how they integrate with on-premises HPC environments.
  • Working knowledge of the network and storage components common to HPC — InfiniBand/Ethernet fabrics, scale-out storage (e.g., Weka, VAST), and high-throughput interconnects — sufficient to triage issues and engage the right escalation path.

Nice To Haves

  • Post-secondary education in Computer Science, Engineering, or a related technical field is an asset; equivalent hands-on operational experience is equally valued.
  • Relevant certifications are advantageous — e.g., CKA/CKAD, Linux+/RHCSA, ITIL Foundation, CompTIA Server+, or HPC/GPU vendor certifications.

Responsibilities

  • Provide Tier 2 operational support for HPC cloud environments, ensuring high availability, performance stability, and adherence to SLAs.
  • Monitor, troubleshoot, and resolve complex incidents across HPC infrastructure, including Kubernetes, Slurm, cluster management systems, and associated cloud services.
  • Act as the escalation point from Tier 1, performing root cause analysis (RCA) and coordinating with Tier 3/engineering teams for defect resolution and permanent fixes.
  • Operate and maintain monitoring, alerting, and observability tooling to proactively detect issues and minimise service disruption.
  • Execute operational changes, patches, upgrades, and maintenance activities in line with change management processes.
  • Maintain and improve operational documentation, including runbooks, playbooks, incident reports, and knowledge base articles to support efficient IOC operations.
  • Contribute to continuous service improvement by identifying recurring issues, operational risks, and automation opportunities within the HPC environment.
  • Support tooling integration and operational readiness for new HPC capabilities prior to handover into production support.
  • Provide technical guidance and mentoring to Tier 1, DC Techs, enhancing IOC capability and knowledge depth.
  • Participate in on-call rotations and major incident response activities.

Benefits

  • Base salary
  • Annual performance incentives
  • Equity programs
  • 100% company paid health insurance premiums (medical, dental, and vision) for employees
  • 75% company paid coverage for dependents
  • Company-paid short-term and long-term disability insurance
  • Voluntary life, critical illness, and accident coverage available
  • Health Savings Accounts (HSA) – when combined with the High Deductible Health Plan
  • Employee Assistance Program and wellness resources
  • 401(k) retirement plan with company match
  • Access to financial planning and legal services
  • Paid Time Off (PTO)
  • Paid holidays
  • Internal skills training and advancement pathways
  • Professional development to support certifications, continuing education, or role related training
  • Company events and team-building activities
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service