Reliability & Observability Analyst I

IREN•Dallas, TX

5d•Onsite

About The Position

IREN is a leading next-generation data center business powering the future with 100% renewable energy. We build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the ever-evolving applications of high-performance computers. We believe that human progress is invaluable, but it should be done in the right way – responsibly, sustainably and having a positive impact on the communities we operate in. We are seeking an IOC Reliability & Observability Analyst I with a strong reliability, observability, and automation mindset to support our 24/7 HPC Data Center Operations. The role focuses on analyzing operational signals, improving incident quality, and supporting AIOps enabled automation and tooling and is designed for candidates early in their careers who want to grow into Site Reliability, Infrastructure Operations, or Platform Engineering paths. This is an entry‑level (Level 1) IOC role focused on operational analysis, data quality, and reliability signal validation rather than system design or engineering ownership. You will support IOC, engineering, and operations teams by analyzing incidents, validating operational signals, and identifying opportunities to improve detection quality and operational reliability under established processes and guidance.

Requirements

1-3 years of experience in IOC, NOC, SRE‑adjacent operations, systems analysis, or technical support roles
Bachelor's degree in Computer Science, Data Science, Statistics, or equivalent hands-on experience
Exposure to 24/7 production environments supporting infrastructure, cloud, or data center operations
Foundational awareness of SRE concepts such as service health, MTTR/MTTD, and the incident lifecycle, with the ability to apply these concepts in operational analysis.
Working knowledge of Linux-based systems, basic networking concepts, and infrastructure dependencies
Experience working with metrics, logs, and alerting systems across infrastructure or application environments
Familiarity with observability platforms (e.g., Splunk, Datadog, Prometheus-style metrics)
Ability to assess alert quality, identify noise, and recognize monitoring gaps
Awareness of AIOps concepts such as anomaly detection, event correlation, and alert noise reduction, primarily for the purpose of reviewing and validating automated insights
Experience validating automated insights and supporting alerting or observability automation
Ability to read automation artifacts (Python, Bash, or configuration-based workflows) and assist with minor updates under documented procedures and guidance
Ability to analyze incident trends and system behaviors with strong attention to data accuracy, signal integrity, and identify recurring issues or improvement opportunities
Clear communication skills and comfort working cross-functionally with operations and engineering teams

Responsibilities

Analyze incident data, system behaviors, and operational signals across GPU clusters, networks, and facilities to identify risks and trends
Identify detection gaps, alert delays, false positives, and under-monitored systems, and document findings for review by IOC leadership or engineering teams
Validate ticketing and incident data for accuracy, completeness, and reporting integrity
Support continuous improvement of observability by evaluating metrics, logs, alerts, and dashboards
Assist in refining operational views focused on service health, reliability, and signal quality
Generate post-incident insights highlighting trends, risks, and improvement opportunities
Support AIOps-enabled capabilities by reviewing outputs from anomaly detection, alert correlation, and event clustering, and flagging accuracy or data-quality issues
Validate automated insights and escalate tuning or accuracy issues to IOC and engineering teams
Assist with testing automation related to alert routing, enrichment, and suppression, and submit recommended changes through established change and review processes
Produce and maintain SLA/KPI dashboards and reliability reports using established templates, definitions, and data sources
Provide data-driven insights and recommendations to inform preventive measures, workflow improvements, and monitoring enhancements
Contribute to runbook updates, operational documentation, and reliability initiatives in partnership with IOC and engineering teams
Develop foundational SRE skills in preparation for expanded operational responsibility
This role operates under defined IOC processes and supervision, with increasing responsibility as skills and experience develop

Benefits

Overtime compensation for non-exempt workers for hours worked over 40 per week
100% company paid health insurance premiums (medical, dental, and vision) for employees, 75% company paid coverage for dependents
Company-paid short-term and long-term disability insurance
Voluntary life, critical illness, and accident coverage available
Health Savings Accounts (HSA) – when combined with the High-Deductible Health Plan
Employee Assistance Program and wellness resources
401(k) retirement plan with company match
Paid professional development and access to financial planning and legal services
Paid Time Off (PTO) and paid holidays
Professional development to support certifications, continuing education, or role related training
Company events and team-building activities