Site Reliability Engineer

Bank of AmericaCharlotte, NC
Onsite

About The Position

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day. Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve. Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations. At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us! Position Summary: Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents Automate repetitive tasks and processes to improve efficiency and reduce manual intervention Create and maintain documentation for system architecture, configuration, and troubleshooting procedures Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering

Requirements

  • 5+ years of experience of designing and implementing scalable systems
  • Strong knowledge of Linux/Unix systems and command line tools
  • Proficiency in scripting languages such as Python, Shell, or Perl
  • Experience with APM tools such as DynaTrace
  • Familiarity with cloud platforms like AWS, Azure, or Google Cloud
  • Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.)
  • Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools
  • Knowledge in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk
  • Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues
  • Excellent communication and collaboration skills to work effectively with cross-functional teams
  • Strong attention to detail and ability to work in a fast[1]paced, dynamic environment

Nice To Haves

  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • Expertise in APM tools, i.e. DynaTrace
  • Experience in network packet capture analysis with Wireshark
  • Skills: Adaptability
  • Analytical Thinking
  • Influence
  • Production Support
  • Risk Management
  • Automation
  • Collaboration
  • Innovative Thinking
  • Result Orientation
  • Solution Design
  • Business Acumen
  • DevOps Practices
  • Project Management
  • Solution Delivery Process
  • Stakeholder Management

Responsibilities

  • Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application
  • Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems
  • Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents
  • Automate repetitive tasks and processes to improve efficiency and reduce manual intervention
  • Create and maintain documentation for system architecture, configuration, and troubleshooting procedures
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards
  • Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service