Senior Lead Site Reliability Engineer

JPMorganChase•Plano, TX

About The Position

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within Consumer and Community banking team, you will set clear quality gates across requirements, design, secure coding, testing, releases, and post-production monitoring to ensure reliability, performance, security, and observability. Job responsibilities Set clear quality gates across requirements, design, secure coding, testing, releases, and post-production monitoring to ensure reliability, performance, security, and observability. Turn business goals into clear, testable requirements—and hold teams to an objective “Definition of Done” before release. Define and manage SLIs/SLOs and error budgets, and ensure they’re reflected in roadmaps and delivery plans. Lead operational readiness reviews, assess delivery risk, and drive fixes through root-cause analysis, corrective actions, and automation to prevent repeat issues. Improve logging, monitoring, and alerting so dashboards are actionable and alerts are tuned to reduce noise and speed response. Own CI/CD controls (security, reliability, testing, change management) and drive automation to reduce toil and increase release confidence. Lead and participate in major incident response (including outside business hours when needed), run post-incident reviews, and drive improvements against KPIs like availability, MTTR, and change failure rate. Required qualifications, capabilities, and skills 10+ years supporting critical applications in large-scale environments, including experience leading and mentoring engineers/teams. Strong SDLC and secure development practices, with experience implementing objective quality gates and release readiness standards. Hands-on SRE experience, including SLIs/SLOs, error budgets, incident management, and post-incident reviews/root-cause analysis. Experience designing actionable monitoring/logging and dashboards (e.g., Splunk, AppDynamics, or equivalent), including alert tuning. Experience with CI/CD pipelines and automated testing (unit, integration, security), plus operational controls that reduce change risk. Calm, accountable incident leadership under pressure, with strong communication and stakeholder management. Comfortable collaborating with global teams and engaging during critical incidents outside standard business hours. Preferred qualifications, capabilities, and skills Proficiency in Python; experience with LangChain, LangGraph, or similar agentic frameworks Experience implementing LLMs using vector databases and Retrieval-Augmented Generation (RAG), as well as model tuning Strong SRE fundamentals: SLOs, SLIs, error budgets, blameless post-mortems, capacity planning Hands-on with observability tooling (Datadog, Prometheus, OpenTelemetry, distributed tracing) Experience leading operational readiness reviews and maintaining “Definition of Done” checklists (SLO monitoring, runbooks, rollback validation, resilience/failover testing, vulnerability remediation, audit/control artifacts). Deep public cloud expertise (AWS or equivalent), including infrastructure automation (Terraform/Terraform Enterprise, CloudFormation), capacity planning, and resilience patterns for distributed systems. Track record of improving reliability outcomes (higher availability, lower MTTR, lower change failure rate) through automation and observability. Splunk Administrator certification (or equivalent). Familiarity with containers and orchestration (Docker, Kubernetes) and modern production operations practices.

Requirements

10+ years supporting critical applications in large-scale environments, including experience leading and mentoring engineers/teams.
Strong SDLC and secure development practices, with experience implementing objective quality gates and release readiness standards.
Hands-on SRE experience, including SLIs/SLOs, error budgets, incident management, and post-incident reviews/root-cause analysis.
Experience designing actionable monitoring/logging and dashboards (e.g., Splunk, AppDynamics, or equivalent), including alert tuning.
Experience with CI/CD pipelines and automated testing (unit, integration, security), plus operational controls that reduce change risk.
Calm, accountable incident leadership under pressure, with strong communication and stakeholder management.
Comfortable collaborating with global teams and engaging during critical incidents outside standard business hours.

Nice To Haves

Proficiency in Python; experience with LangChain, LangGraph, or similar agentic frameworks
Experience implementing LLMs using vector databases and Retrieval-Augmented Generation (RAG), as well as model tuning
Strong SRE fundamentals: SLOs, SLIs, error budgets, blameless post-mortems, capacity planning
Hands-on with observability tooling (Datadog, Prometheus, OpenTelemetry, distributed tracing)
Experience leading operational readiness reviews and maintaining “Definition of Done” checklists (SLO monitoring, runbooks, rollback validation, resilience/failover testing, vulnerability remediation, audit/control artifacts).
Deep public cloud expertise (AWS or equivalent), including infrastructure automation (Terraform/Terraform Enterprise, CloudFormation), capacity planning, and resilience patterns for distributed systems.
Track record of improving reliability outcomes (higher availability, lower MTTR, lower change failure rate) through automation and observability.
Splunk Administrator certification (or equivalent).
Familiarity with containers and orchestration (Docker, Kubernetes) and modern production operations practices.

Responsibilities

Set clear quality gates across requirements, design, secure coding, testing, releases, and post-production monitoring to ensure reliability, performance, security, and observability.
Turn business goals into clear, testable requirements—and hold teams to an objective “Definition of Done” before release.
Define and manage SLIs/SLOs and error budgets, and ensure they’re reflected in roadmaps and delivery plans.
Lead operational readiness reviews, assess delivery risk, and drive fixes through root-cause analysis, corrective actions, and automation to prevent repeat issues.
Improve logging, monitoring, and alerting so dashboards are actionable and alerts are tuned to reduce noise and speed response.
Own CI/CD controls (security, reliability, testing, change management) and drive automation to reduce toil and increase release confidence.
Lead and participate in major incident response (including outside business hours when needed), run post-incident reviews, and drive improvements against KPIs like availability, MTTR, and change failure rate.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume