Site Reliability Engineer

Bank of America•Charlotte, NC

69d•Onsite

About The Position

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day. Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve. Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations. At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us! Position Summary: Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents Automate repetitive tasks and processes to improve efficiency and reduce manual intervention Create and maintain documentation for system architecture, configuration, and troubleshooting procedures Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering

Requirements

5+ years of experience of designing and implementing scalable systems
Strong knowledge of Linux/Unix systems and command line tools
Proficiency in scripting languages such as Python, Shell, or Perl
Experience with APM tools such as DynaTrace
Familiarity with cloud platforms like AWS, Azure, or Google Cloud
Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.)
Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools
Knowledge in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk
Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues
Excellent communication and collaboration skills to work effectively with cross-functional teams
Strong attention to detail and ability to work in a fast[1]paced, dynamic environment

Nice To Haves

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
Expertise in APM tools, i.e. DynaTrace
Experience in network packet capture analysis with Wireshark
Skills: Adaptability
Analytical Thinking
Influence
Production Support
Risk Management
Automation
Collaboration
Innovative Thinking
Result Orientation
Solution Design
Business Acumen
DevOps Practices
Project Management
Solution Delivery Process
Stakeholder Management

Responsibilities

Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application
Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems
Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents
Automate repetitive tasks and processes to improve efficiency and reduce manual intervention
Create and maintain documentation for system architecture, configuration, and troubleshooting procedures
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards
Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume