Site Reliability Engineer II

Bank of America•Charlotte, NC

4d•Onsite

About The Position

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day. Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve. Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations. At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us! This job is responsible for partnering with engineering and technology teams to implement measures as prescribed by lead/senior SRE engineers. Key responsibilities include ensuring appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services, identifying root causes of issues through production triage efforts, and suggesting code enhancements to technology teams to automate services and improve reliability and efficiency. Job expectations include using software development skills to improve efficiency and to address gaps in reliability. Site Reliability Engineer (SRE) focused on building, maintaining, and improving the reliability, scalability, and performance of cloud infrastructure using Infrastructure as Code (IaC) and Terraform Enterprise. Supports the delivery of secure, compliant, and highly available cloud environments aligned with enterprise standards and regulatory requirements.

Requirements

5+ years of experience in platform, systems, or infrastructure engineering, with a strong focus on automation and integration.
Proficiency in SRE best practices; Proven ability to reduce toil and improve observability of the environment.
Experience with automation and orchestration tools (e.g., Ansible or similar), and scripting with golang, Python, or equivalent.
Experience with supporting enterprise service mesh platforms
Experience with Infrastructure as Code (IaC) concepts and CI/CD pipelines supporting automated builds, validation, and deployments.
Experience integrating provisioning workflows with platform services such as virtualization, networking, identity, monitoring, and configuration management systems.
Strong focus on testing and reliability, including automated integration/validation testing and troubleshooting of complex workflows

Nice To Haves

Experience with CI/CD pipelines and infrastructure-as-code (e.g., Terraform, Ansible)
Familiarity with containerization and orchestration platforms (e.g., Kubernetes)
Experience in financial services or highly regulated environments
Strong analytical and problem-solving skills with data-driven decision making
Ability to communicate complex technical concepts to non-technical stakeholders
Prior experience mentoring or leading engineering teams

Responsibilities

Develops and maintains reliability scripts, tools and libraries and leverages them for common instrumentation, automation, and operational needs, and when mentoring Site Reliability Engineer (SRE) resources on reliability practices and established tools/capabilities
Collaborates with Development and Infrastructure teams to understand technical solutions and implement monitoring capabilities outlined in the application and system monitoring designs put forward by the SRE Lead
Partners to implement code changes to make use of common reliability libraries and tools and helps Application Production Services and Application Development teammates understand how to use them
Identifies vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and defines solutions to reduce manual support effort and/or improve system reliability
Engages as a subject matter expert in major incident triage efforts and failure scenario modelling and diagnosis with Problem Manager root causes for major incident/problem management investigations
Participates regularly in an on-call rotation with Production Support teammates to learn more about reliability issues affecting their portfolio
Collaborate with Development and Infrastructure teams to understand technical solutions and to implement the monitoring capabilities outlined in the application and system monitoring designs put forward by the SRE Lead.
Mentor SRE resources on reliability practices and established tools/capabilities.
Develop and maintain a catalog of extensible reliability scripts, tools and libraries that can be leveraged for common instrumentation, automation, and operational needs.
Partner to implement code changes to make use of common reliability libraries and tools and help Application Production Services (APS) and Application Development teammates understand how to use them.
Partner with infrastructure engineers and application teams to implement the necessary code changes to make use of common reliability libraries and tools and help the APS and Application Development teammates understand how to use them.
Engage as a subject matter expert (SME) in major incident triage efforts, failure scenario modelling and work with Problem Manager to diagnose root causes for major incident / problem management investigations.
Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring.