Lead Infrastructure Engineer (SRE)

Wells Fargo•Concord, CA

1d•$119,000 - $224,000•Hybrid

About The Position

Wells Fargo is seeking a Lead Infrastructure Engineer (SRE) to join Technology within COO Tech Banking Operations. This is a pivotal role for an experienced engineer who thrives on solving complex problems through innovation and driving change at scale in a large, diverse enterprise environment. As a Lead SRE, you will be part of a high-impact team responsible for advancing and embedding SRE practices across multiple applications and critical customer journeys within the Banking Operations platform. You will play a central role in transforming how reliability, scalability, and observability are engineered and sustained—helping to shape a modern, resilient, and data-driven technology ecosystem. This team is at the forefront of driving technology transformation across the enterprise by adopting SRE-aligned capabilities, launching new tooling, automating complex operational challenges, and integrating with modern platforms and pipelines. Leveraging your background in software and systems engineering, you will ensure that onboarded applications are highly available, resilient, and fully instrumented with end-to-end observability. In this role, you will lead the adoption and evolution of observability practices—including metrics, logging, tracing, and telemetry—while promoting operational excellence through code, automation, and continuous improvement. You will introduce and scale data-driven insights, enabling smarter decision-making and proactive issue resolution across the ecosystem. You will also partner closely with application and platform engineering teams to ensure services are reliable, measurable, and continuously improving. Your work will include building and enhancing CI/CD integration, validating system reliability through rigorous testing, and driving the modernization of operational practices across the organization.

Requirements

5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of experience using Observability Tools with hands-on implementation of monitoring, logging, or tracing solutions utilizing Grafana, ThousandEyes, Prometheus, AppDynamics, or Splunk
3+ years of application production support experience in complex, high-availability environments
2+ years of experience with Confluence or Jira

Nice To Haves

Experienced with Site Reliability Engineering (SRE) including SLO/SLI frameworks, error budgets, toil reduction, and production reliability engineering practices
Experience with database logging and monitoring concepts experience
Experience with Application performance monitoring and optimization using BlazeMeter, JMeter, Splunk, AppDynamics, or similar observability platforms
Experience with scripting or programming languages such as Bash, PowerShell, Python, Shell, VBScript, or JavaScript for automation and reliability engineering use cases
Experience and understanding of AIOps and related tools such as MoogSoft or Big Panda, including event correlation and noise reduction
Experience with one or more automation tools such as Ansible or similar infrastructure-as-code/configuration management tools
Experience with Container technologies: Kubernetes, Docker, PKS, with focus on observability and reliability patterns in distributed systems

Responsibilities

Drive and lead Site Reliability Engineering capabilities at Wells Fargo Banking Operations igniting the practice, principles, and culture, leading by example.
Mentor and coach engineers while scaling the SRE practice within Banking Operations and partnering with peer platform embedded SRE teams
Leverage enterprise capabilities, tools, and innovation to improve availability in a complex ecosystem by maturing observability practices including monitoring, logging, distributed tracing, synthetic monitoring, and chaos engineering with a focus on actionable insights and proactive detection
Lead the evolution of our environment introducing self-healing and autonomic capabilities, solving complex operational and systemic issues with precision including building and training models, automating cognitive processes, and leveraging telemetry to improve availability and reliability of products we provide to customers
Own and automate key SRE metrics and IT Service Operations processes including customer impact, golden signals and critical user journeys, % availability of critical business flows, SLO/SLI definition and adherence, error budget management, and real-time observability dashboards; automate incident response processes through data integration with unified communications and alerting/notification systems
Provide leadership in support responsibilities for critical applications and customer journeys onboarded to SRE including rapid remediation of issues through Agile practices, conducting blameless post mortems, driving root cause analysis, and implementing durable solutions through continuous improvement with the goal of eliminating repeat incidents