Site Reliability Engineer Lead

Bank of America•Plano, TX

21h•Onsite

About The Position

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day. Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve. Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations. At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. This job is responsible for building and leading a team to deliver technology products and services that meet business outcomes. Key responsibilities include developing a technology strategy, ensuring technology solutions comply with applicable standards, promoting design, engineering, and organizational practices, and advocating and advancing modern, Agile solution delivery practices. Job expectations may include coaching, mentoring, providing feedback and hands on career development, identifying emerging talent, fostering leadership skills, and managing stakeholders. The role is seeking a seasoned Site Reliability Engineering (SRE) Leader to drive the reliability, scalability, and performance of critical Infrastructure Automation platforms. This role will lead the design and implementation of SRE practices across a federated technology ecosystem, ensuring operational excellence through automation, observability, and resilient architecture. The ideal candidate will bring deep expertise in distributed systems, cloud-native infrastructure, SaaS application support and DevOps/SRE principles, along with strong leadership and collaboration skills to influence cross-functional engineering and Production management teams and drive continuous improvement in service reliability.

Requirements

10+ years of experience in systems engineering, DevOps, or SRE roles in large-scale environments.
Deep understanding of Linux/Unix & Windows systems, networking, and distributed computing.
Proven experience with observability stacks (e.g., Dynatrace, Grafana, Splunk, OpenTelemetry).
Expertise in infrastructure-as-code and automation tools (e.g., Terraform, Ansible, Python).
Strong knowledge of cloud platforms and container orchestration (Kubernetes).
Demonstrated success in leading incident response and driving systemic improvements.
Experience with capacity planning, performance tuning, and cost optimization.
Excellent communication and stakeholder management skills, including executive engagement.

Nice To Haves

Experience with ITIL/ITSM processes and integration with platforms like ServiceNow.
Familiarity with security and compliance in regulated industries (e.g., financial services).
Background in performance engineering and infrastructure analytics.
Experience developing dashboards and metrics for operational health and reliability.

Responsibilities

Define and implement SRE frameworks, including SLIs/SLOs/SLAs, error budgets, and incident response protocols.
Establish governance models for reliability engineering across distributed teams.
Champion a culture of observability, proactive monitoring, and continuous feedback loops.
Lead root cause analysis (RCA) and post-incident reviews to identify systemic issues and prevent recurrence.
Implement proactive problem detection using telemetry, anomaly detection, and trend analysis.
Collaborate with engineering and operations teams to eliminate toil and reduce incident frequency and impact.
Develop and maintain capacity models to ensure systems scale efficiently with business demand.
Monitor performance trends and lead optimization efforts across infrastructure and applications.
Partner with finance and engineering teams to align capacity planning with cost and growth objectives.
Drive automation of operational tasks including deployments, scaling, and recovery.
Integrate reliability tooling with CI/CD pipelines, ITSM platforms (e.g., ServiceNow), and observability systems.
Oversee major incident response, escalation, and communication processes.
Develop and maintain runbooks, playbooks, and escalation protocols.
Drive continuous improvement through blameless retrospectives and operational reviews.
Serve as a senior technical advisor and thought leader in SRE and platform engineering.
Mentor and guide SRE teams and partner with engineering leaders across the enterprise.
Provide input on staffing, tooling strategy, and budget planning for reliability initiatives.
Models an inclusive environment for employees and clients, aligned to company Great Place to Work goals.
Demonstrates deep process knowledge, operational excellence and innovation through a focus on simplicity, data based decision making and continuous improvement.
Communicates enterprise decisions, purpose, and results, and connects to team strategy, priorities and contributions.
Ensures proper risk discipline, controls and culture are in place to identify, escalate and debate issues.
Provides inspection, coaching and feedback to motivate, differentiate and improve performance.
Actively manages expenses and budgets in alignment with objectives, making sound financial decisions.
Assesses talent and builds bench strength for roles across the organization.
Delivers results by effectively prioritizing, inspecting and appropriately delegating team work.