Senior Site Reliability Engineer - Incident Management/Resiliency (Hybrid)

Enova InternationalChicago, IL
3h$85,000 - $125,000Hybrid

About The Position

Resilience Engineering is a subset of the Site Reliability Engineering team that strives to foster a culture of continuous improvement through incident analysis, process evolution, and problem-solving. We work closely with teams across Tech, Product, and Operations through our Production Incident process to uncover system weaknesses, learn from failures, and make our technology more reliable. In this role, you’ll play a key role in enhancing the resiliency of our systems. Your work will focus on our incident response, reporting and analysis processes, enabling the organization to better prepare for and respond to complex system failures. You’ll drive cross-department efforts to deliver reliable, resilient, and observable solutions at Enova

Requirements

  • 5+ Experience in a technology or analyst role (e.g., Software Engineering, Systems, Operations, SRE, or Product).
  • A strong interest in how complex distributed systems operate—and how to make them more reliable.
  • Analytical and problem-solving skills with a systems-thinking mindset.
  • Strong communication skills, both verbal and written, with the ability to tailor messaging to technical and non-technical audiences.
  • Experience querying and analyzing data (e.g., SQL, PostgreSQL, Kafka).
  • Comfort with ambiguity, and the ability to turn vague problems into actionable insights.
  • Demonstrated maturity, sound judgment, and organizational awareness.
  • Ability to coordinate the resolution of major incidents and reviews following Enova Incident Management Process
  • Ability to seamlessly shift between high-urgency incident response and structured project work, with strong organizational skills and the capacity to manage projects independently.

Nice To Haves

  • Experience leading resolution of major system outages or production incidents.
  • Experience driving large-scale technical or process changes.

Responsibilities

  • Lead production incidents as part of our PI PIC (or Incident Commander) rotation after completing training, ensuring clear communication and resolution.
  • Capture and maintain detailed documentation of incidents, contributing factors, and learnings in formal incident reports.
  • Deliver documentation that is clear, comprehensive, and accessible to different types of audiences in a timely manner within the established SLAs.
  • Facilitate and document blameless post-incident reviews that promote learning and continuous improvement.
  • Collect and analyze incident data in order to identify systemic issues, risks, and trends. Lead incident data reviews in front of a wide range of stakeholders, including technical and business leadership
  • Work on improvements to how we collect, analyze, and learn from system failures.
  • Champion a culture of operational excellence and resilience across the organization.
  • Collaborate with engineering, product, and operations teams to address vulnerabilities and build more resilient systems.
  • Design and run failure simulations (e.g., mock incidents, disaster recovery exercises) to proactively identify weak points.

Benefits

  • Health, dental, and vision insurance including mental health benefits
  • 401(k) matching plus a roth option (U.S. Based employees only)
  • PTO & paid holidays off
  • Sabbatical program (for eligible roles)
  • Summer hours (for eligible roles)
  • Paid parental leave
  • DEI groups (B.L.A.C.K. @ Enova, HOLA @ Enova, Women @ Enova, Pride @ Enova, South Asians @ Enova, APEX @ Enova, and Parents @ Enova)
  • Employee recognition and rewards program
  • Charitable matching and a paid volunteer day…Plus so much more!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service