The Site Reliability Engineering Lead is a senior, hands-on technical leader within the Wholesale Production Support Operations organization. This teammate is accountable for elevating the reliability, resiliency, and operational excellence of critical enterprise platforms across hybrid cloud and onprem environments. Acting as both a hands-on SRE expert and a cross-domain influencer, the SRE Lead drives systemic improvements in observability, automation, AIOps adoption, fault tolerance, and incident management. The role partners closely with Application Development, Infrastructure, Production Support, Platform Delivery, Architecture, Cybersecurity, Risk, and Business technology teams to uplift operational practices and deliver stable, predictable, and scalable services. This position also plays a pivotal role in building and maturing the SRE Center for Enablement (C4E) by contributing standards, repeatable patterns, runbooks, playbooks, and coaching that amplify reliability practices across the enterprise. The SRE Lead delivers measurable impact through deep expertise in distributed systems, modern operational tooling, cloud-native reliability patterns, and enterprise-scale incident/problem management. Key Responsibilities include: Incident & Problem Management Leadership (leading major and high-severity incident response efforts, focusing on diagnosing technical root causes, driving multi-team technical resolution, problem management, establishing standardized incident playbooks, escalation paths, and communication frameworks); Reliability Engineering & Automation (architecting and delivering automation solutions, implementing intelligent alerting, anomaly detection, and event correlation leveraging AI and AIOps tools, guiding and enforcing SLO/SLI adoption); Observability & Operational Excellence (enhancing telemetry coverage across logs, metrics, traces, and events using platforms such as Dynatrace and Splunk, defining and standardizing enterprise observability practices, dashboards, and KPIs, ensuring operational readiness through resiliency testing, chaos engineering, and failure-mode validation); Cross-Functional Leadership & Influence (partnering with Delivery, Architecture, Security, and Risk teams to embed reliability and resilience into design and execution, acting as a change agent, leading workshops, maturity assessments, and enablement sessions); Standardization & Documentation (developing, maintaining, and enforcing runbooks, response playbooks, and automated recovery patterns, contributing to enterprise SRE frameworks, templates, and maturity models, promoting consistent adoption of best practices); Mentorship & Technical Development (coaching and mentoring Associate, Professional, and Senior SREs, providing thought leadership in SRE methodologies, cloud-native operational patterns, and automated reliability engineering).
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Number of Employees
5,001-10,000 employees