Senior Site Reliability Engineer

First Horizon Bank•Plano, TX

99d•Onsite

About The Position

We are seeking a Senior Site Reliability Engineer who will be the guardian of our Azure infrastructure reliability. This role focuses on building comprehensive observability platforms, implementing intelligent monitoring systems, and proactively identifying issues before they impact production. You will create the tools and automation that predict, detect, and prevent problems rather than simply reacting to them. Your primary mission is ensuring our Azure infrastructure and applications never surprise us with failures. The ideal candidate has deep expertise in Azure Monitor, Application Insights, Log Analytics, and KQL, combined with strong scripting skills in Python or PowerShell. You should have 5-7+ years of experience implementing observability platforms and a proven track record of preventing incidents through proactive monitoring and automation. You'll work with technologies like Prometheus, Grafana, OpenTelemetry, and Azure services (AKS, App Services, Azure SQL, Cosmos DB) while building self-healing automation and predictive analytics tools that keep our systems healthy.

Requirements

Deep expertise in Azure Monitor, Application Insights, Log Analytics, and KQL
Strong scripting skills in Python or PowerShell
5-7+ years of experience implementing observability platforms
Proven track record of preventing incidents through proactive monitoring and automation

Responsibilities

Design and implement comprehensive observability stack across all Azure resources and applications
Build intelligent alerting systems with anomaly detection and predictive capabilities to prevent incidents
Create self-healing automation and auto-remediation tools that resolve issues without human intervention
Develop internal monitoring platforms, dashboards, and CLI tools for engineering teams
Write KQL queries and analyze metrics/logs to identify optimization opportunities and predict failures
Implement continuous resource monitoring for Azure quotas, costs, security posture, and service health
Build capacity forecasting and trend analysis tools to prevent resource exhaustion
Reduce alert noise while improving coverage and actionability of monitoring systems
Participate in light on-call rotation (prevention-focused approach reduces reactive incidents)

Benefits

Medical with wellness incentives, dental, and vision
HSA with company match
Maternity and parental leave
Tuition reimbursement
Mentor program
401(k) with 6% match
More -- FirstHorizon.com/First-Horizon-National-Corporation/Careers/Our-Benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume