About The Position

At ERCOT, our diverse and dynamic work environment provides a platform on which employees can work together to build the future of the Texas power grid and wholesale market utilizing the latest technologies and resources. We encourage you to join our talented, dedicated workforce to develop world-class solutions for today and tomorrow’s energy challenges while learning new skills and growing your career. ERCOT is committed to fostering inclusion at all levels of our company. It is the cornerstone of our corporate values of accountability, leadership, innovation, trust, and expertise. We know that individuals with a wide variety of talents, ideas, and experiences propel the innovation that drives our success. An inclusive and diverse workforce strengthens us and allows for a collaborative environment to solve the challenges that face our industry today and in the future. JOB SUMMARY ERCOT is seeking a Senior or Lead Site Reliability Engineer (SRE) with strong Java application expertise to ensure the availability, performance, and reliability of mission-critical systems. This role will follow ERCOT specific SRE process and principles which includes managing site failover between 2 datacenters as well as treating Azure as an extended datacenter in the future. You will work deeply with Java codebases while owning production health and operational excellence.

Requirements

  • 5+ years (Senior) or 10+ years (Lead) in SRE, DevOps, or Production Engineering
  • Strong Java experience (Spring-based systems)
  • Experience with distributed, high-availability systems
  • Expertise in observability tools (metrics, logs, traces)
  • CI/CD experience (Git, Maven, Jenkins)
  • Strong cross-layer debugging skills
  • CS or related degree required
  • Strong hands-on experience with observability and APM platforms such as Splunk, Dynatrace, DataDog
  • Expertise in using Metrics, Logs, Traces, and Profiling (MLTP) to troubleshoot complex production incidents
  • Experience with Grafana LGTM Stack for Observability (Loki - for logs, Grafana - for dashboards and visualization, Tempo - for traces, and Mimir - for metrics)
  • Experience correlating application performance data with system behavior to identify root causes and prevent recurrence

Nice To Haves

  • Python
  • Kubernetes or OpenShift
  • Microsoft Azure
  • Kafka or ActiveMQ
  • Infrastructure automation (Terraform, Azure Resource Manager, Ansible, Liquibase)
  • Chaos or load testing experience

Responsibilities

  • Own reliability, availability, latency, and scalability of Java-based systems
  • Define and track SLIs, SLOs, and error budgets
  • Design and maintain monitoring, alerting, logging, and dashboards
  • Lead incident response and conduct blameless postmortems
  • Reduce operational toil through automation and tooling
  • Review system designs for reliability and failure modes
  • (Lead level) Establish reliability standards and mentor engineers
  • Debug and improve Java applications (Spring Boot preferred)
  • Perform JVM tuning and performance analysis
  • Diagnose failures across databases, messaging, and APIs
  • Partner with development teams to improve resilience and recovery
  • Participate in an on-call rotation for supported services
  • Focus on engineering solutions rather than repetitive manual work
  • Emphasis on post-incident learning and automation
  • Toil is tracked and actively reduced

Benefits

  • health
  • dental
  • vision
  • life insurance
  • long/short-term disability insurance
  • long-term care insurance
  • Section 125 Flexible Spending Account
  • Retirement Savings Plan
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service