Site Reliability Engineer (Java focused) Sr or Lead

ERCOT•Taylor, TX

29d•Hybrid

About The Position

At ERCOT, our diverse and dynamic work environment provides a platform on which employees can work together to build the future of the Texas power grid and wholesale market utilizing the latest technologies and resources. We encourage you to join our talented, dedicated workforce to develop world-class solutions for today and tomorrow’s energy challenges while learning new skills and growing your career. ERCOT is committed to fostering inclusion at all levels of our company. It is the cornerstone of our corporate values of accountability, leadership, innovation, trust, and expertise. We know that individuals with a wide variety of talents, ideas, and experiences propel the innovation that drives our success. An inclusive and diverse workforce strengthens us and allows for a collaborative environment to solve the challenges that face our industry today and in the future. JOB SUMMARY ERCOT is seeking a Senior or Lead Site Reliability Engineer (SRE) with strong Java application expertise to ensure the availability, performance, and reliability of mission-critical systems. This role will follow ERCOT specific SRE process and principles which includes managing site failover between 2 datacenters as well as treating Azure as an extended datacenter in the future. You will work deeply with Java codebases while owning production health and operational excellence.

Requirements

5+ years (Senior) or 10+ years (Lead) in SRE, DevOps, or Production Engineering
Strong Java experience (Spring-based systems)
Experience with distributed, high-availability systems
Expertise in observability tools (metrics, logs, traces)
CI/CD experience (Git, Maven, Jenkins)
Strong cross-layer debugging skills
CS or related degree required
Strong hands-on experience with observability and APM platforms such as Splunk, Dynatrace, DataDog
Expertise in using Metrics, Logs, Traces, and Profiling (MLTP) to troubleshoot complex production incidents
Experience with Grafana LGTM Stack for Observability (Loki - for logs, Grafana - for dashboards and visualization, Tempo - for traces, and Mimir - for metrics)
Experience correlating application performance data with system behavior to identify root causes and prevent recurrence

Nice To Haves

Python
Kubernetes or OpenShift
Microsoft Azure
Kafka or ActiveMQ
Infrastructure automation (Terraform, Azure Resource Manager, Ansible, Liquibase)
Chaos or load testing experience