Site Reliability Engineer - Banking and Payments Director

Morgan Stanley•Alpharetta, GA

About The Position

We’re seeking someone to join our team as a Site Reliability Engineer preferably working in the financial IT community. The position in the WM Prod Tech team is focused on delivering exceptional services to both BU and Dev partners to minimize/avoid any production outages. The role will focus on production support within WM Prod Tech automating deployments and working with the agile teams to build and support stable and reliable production systems. In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities. This is a Lead Software Production Management & Reliability Engineering position at Director which is part of the job family responsible for overseeing the production environment, ensuring the operational reliability of deployed software, and implementing strategies to optimize performance and minimize downtime. Since 1935, Morgan Stanley is known as a global leader in financial services, continuously evolving and innovating to better serve our clients and our communities in more than 40 countries around the world.

Requirements

5+ years of experience in a production environment with a solid software development background and understanding of performance tuning, end-to-end troubleshooting, networking fundamentals and appropriate attention to detail.
Ability to focus, provide resolutions for production issues in a high demanding and pressured environment
Hands-on experience in application and database troubleshooting/issue resolution in a fast-paced environment
Automation-related experience using one of the following scripting languages: Python or Perl or Shell scripting.
Strong experience in Continuous Integration and Continuous deployment
Strong experience in environment on demand for both Virtual Machines and containers
Strong database skills with Sybase or Oracle or DB2.
Hands-on experience with LINUX/UNIX
Hands-on experience with PERL/Java
Practical experience on Agile Methodology (e.g., Scrum).
Awareness of, and ability to reason about modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, micro services, Cloud, etc.
Excellent communication and ability to think out of the box for process improvements.
Bachelor's/Master's Degree in Computer Science, Information Systems, or related field.

Nice To Haves

Knowledge of Retail Banking and Payments.
Knowledge of Cloud based deployment, security, networking concepts in Azure and AWS.
Knowledge of Control M or other batch scheduling software.
Experience in Continuous Integration and Continuous deployment.
Knowledge and hands-on experience on with monitoring tools like Kibana, Loki, Grafana.
Knowledge or experience with automating deployments using Jenkins, Train or Windeploy
Interest in designing, analyzing, and troubleshooting large-scale distributed systems.

Responsibilities

Work closely with support/development teams to design, build, and maintain systems
Troubleshoot both non-prod and production issues across the entire stack: hardware, software, application, and network
Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management, and visibility of our services
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity; includes automation for other various operational needs
Work with upstream data providers and upstream consumers, and reducing the amount of escalation to development teams
Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
Use analytical skills to find trends in the environment and drive out problems.
Help design and implement telemetry and statistics gathering to locate areas of the plant where effort needs to be focused to make improvements.
Maintain applications once they are live by measuring and monitoring availability, latency, and overall system health with a focus on business activities and continuously evaluate cost and waste.
Work closely with Application Development to ensure that the support team has excellent knowledge of the application set, own and maintain support knowledgebase and documents.
Be flexible to provide weekend on call rotation and attend calls with other team members from other time zones.
Develop scripts and assist with code changes along with operational tasks/activities
Take ownership and managing production requests, questions, issues and perform Root Cause Analysis for outages/incidents
Understand the overall business flow of supported application systems and its interface with clients
Be flexible to provide weekend on call rotation