Site Reliability Engineer Lead

Bank of America•Plano, TX

65d•Onsite

About The Position

At Bank of America, the company is guided by a common purpose to help make financial lives better through responsible growth and delivering for clients, teammates, communities, and shareholders. Being a Great Place to Work is central to this, including an inclusive workplace, attracting and developing talent, supporting wellness, recognizing performance, and community impact. Bank of America maintains an in-office culture with specific attendance requirements and flexibility based on role. This specific job is responsible for building and leading a team to deliver technology products and services that meet business outcomes. Key responsibilities include developing a technology strategy, ensuring compliance with standards, promoting design, engineering, and organizational practices, and advocating for modern, Agile solution delivery. The role may involve coaching, mentoring, providing feedback, career development, identifying emerging talent, fostering leadership skills, and managing stakeholders. The position seeks a seasoned Site Reliability Engineering (SRE) Leader to drive the reliability, scalability, and performance of critical Infrastructure Automation platforms. This leader will design and implement SRE practices across a federated technology ecosystem, ensuring operational excellence through automation, observability, and resilient architecture. The ideal candidate should have deep expertise in distributed systems, cloud-native infrastructure, SaaS application support, and DevOps/SRE principles, along with strong leadership and collaboration skills to influence cross-functional engineering and Production management teams and drive continuous improvement in service reliability.

Requirements

10+ years of experience in systems engineering, DevOps, or SRE roles in large-scale environments.
Deep understanding of Linux/Unix & Windows systems, networking, and distributed computing.
Proven experience with observability stacks (e.g., Dynatrace, Grafana, Splunk, OpenTelemetry).
Expertise in infrastructure-as-code and automation tools (e.g., Terraform, Ansible, Python).
Strong knowledge of cloud platforms and container orchestration (Kubernetes).
Demonstrated success in leading incident response and driving systemic improvements.
Experience with capacity planning, performance tuning, and cost optimization.
Excellent communication and stakeholder management skills, including executive engagement.

Nice To Haves

Experience with ITIL/ITSM processes and integration with platforms like ServiceNow.
Familiarity with security and compliance in regulated industries (e.g., financial services).
Background in performance engineering and infrastructure analytics.
Experience developing dashboards and metrics for operational health and reliability.

Responsibilities

Define and implement SRE frameworks, including SLIs/SLOs/SLAs, error budgets, and incident response protocols.
Establish governance models for reliability engineering across distributed teams.
Champion a culture of observability, proactive monitoring, and continuous feedback loops.
Lead root cause analysis (RCA) and post-incident reviews to identify systemic issues and prevent recurrence.
Implement proactive problem detection using telemetry, anomaly detection, and trend analysis.
Collaborate with engineering and operations teams to eliminate toil and reduce incident frequency and impact.
Develop and maintain capacity models to ensure systems scale efficiently with business demand.
Monitor performance trends and lead optimization efforts across infrastructure and applications.
Partner with finance and engineering teams to align capacity planning with cost and growth objectives.
Drive automation of operational tasks including deployments, scaling, and recovery.
Integrate reliability tooling with CI/CD pipelines, ITSM platforms (e.g., ServiceNow), and observability systems.
Oversee major incident response, escalation, and communication processes.
Develop and maintain runbooks, playbooks, and escalation protocols.
Drive continuous improvement through blameless retrospectives and operational reviews.
Serve as a senior technical advisor and thought leader in SRE and platform engineering.
Mentor and guide SRE teams and partner with engineering leaders across the enterprise.
Provide input on staffing, tooling strategy, and budget planning for reliability initiatives.
Models an inclusive environment for employees and clients, aligned to company Great Place to Work goals.
Demonstrates deep process knowledge, operational excellence and innovation through a focus on simplicity, data based decision making and continuous improvement.
Communicates enterprise decisions, purpose, and results, and connects to team strategy, priorities and contributions.
Ensures proper risk discipline, controls and culture are in place to identify, escalate and debate issues.
Provides inspection, coaching and feedback to motivate, differentiate and improve performance.
Actively manages expenses and budgets in alignment with objectives, making sound financial decisions.
Assesses talent and builds bench strength for roles across the organization.
Delivers results by effectively prioritizing, inspecting and appropriately delegating team work.