SRE Application Support Engineer - AWS

Morgan Stanley•New York, NY

64d

About The Position

In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities. This is a Lead Software Production Management & Reliability Engineering position at Director level which is part of the job family responsible for overseeing the production environment, ensuring the operational reliability of deployed software, and implementing strategies to optimize performance and minimize downtime. Since 1935, Morgan Stanley has been known as a global leader in financial services, always evolving and innovating to better serve our clients and our communities in more than 40 countries around the world. Department Profile Morgan Stanley Wealth Management (MSWM) Technology is the global technology department responsible for the design, development, delivery and support of the technical solutions behind the products and services used by the Morgan Stanley Wealth Management (MSWM) business. The department is comprised of 10 organizations: Sales, Banking & Corporate-Client Technology, Investment Products & Markets Technology, Client Reporting, Core Processing, Private and International Wealth Management Technology, Technology Integration Office, Enterprise Infrastructure & Production Management, Capital Markets Application & Data Services, Deployment Planning & Release Management, and the Chief Operating Office. The Reliability Operations (RO) within WMT is responsible for providing swift, courteous, and knowledgeable customer service to end users of the production systems. This position is focused on user and systems support, answering hotline calls, monitoring systems alerts, and taking corrective action. Technical understanding is important as well as the ability to speak to users and understand their problems. In addition to direct user support tasks, the team performs infrastructure related tasks including process configuration, hardware capacity planning, event management, release work, and support tool development to ensure any repetitive tasks are packaged to remove any element of risk. Position Overview The Wealth Management Production Management Site Reliability Engineer position is a highly visible/critical role, which will be a team member of technical SME’S managing the stability and optimization of the Wealth Management systems. Scope includes but not limited to, the day-to-day support of the organization’s technology related outages, collaboration on technology projects focused on stability, optimization, business impact analysis, and associated risk-related methodologies. This role will be responsible for overall stability of the Wealth Management Investment Management application platforms, participation on key optimization initiatives, and collaboration with multiple technical teams within Morgan Stanley. Additionally, partners with WM business units, various levels of management and staff to collect, analyze and make recommendations on optimizing the platform. As a team member with expertise in deep analytical triage, you will provide subject matter expertise in debugging, issue analysis and troubleshooting, working with business and technical colleagues to provide reviews and recommendations to avoid any future application issues. Produce guidance documentation, standards and procedures, products assessments, and training material including working with the various application and infrastructure support teams ensuring that they are documenting every single troubleshooting step in Morgan Stanley knowledge base system to resolve issues in a faster time frame. You will serve as a fully seasoned/proficient technical resource; provide technical knowledge in outage management and proactive solutions to improve the user experience. This position will mainly perform DevOps/ SRE role in Application support, Platform Stability and Resiliency.

Requirements

Minimum 5–7 years of experience developing and/or supporting enterprise applications
3–5 years of experience leading a small-to-medium team with similar skill sets
Willingness to embrace Agile practices and DevOps/SRE concepts
Working knowledge of DevOps and observability tools (e.g., Grafana, Prometheus, Splunk, Kibana)
Strong analytical skills, including problem determination and recovery/resolution processes
Ability to interface with and build strong working relationships across technology teams, business analysts, and vendors
Understanding of database engineering and the ability to develop high-quality database solutions
Experience with AWS services such as EC2, ECS, S3, Fargate, Aurora, and Lambda
Hands-on scripting experience with Python and working knowledge of Java
Administrative competence in at least one major programming language or platform (e.g., Perl, PowerShell, Unix scripting, Java, C#, .NET)
Exposure to application technologies such as JavaScript, React, GraphQL, Python, Django, Celery, PostgreSQL, Golang, ElasticSearch, RabbitMQ and Kafka.
Fast learner who thrives in a fast-paced environment
Strong organizational skills and the ability to manage multiple tasks and high-pressure outage situations through resolution
Driven to learn new technologies and techniques and to be an integral member of the team
Hands-on experience administering large-scale, high-availability systems and using tools to monitor performance and availability
Experience in creating technical architecture documentation
Excellent written and verbal communication skills for technical discussions across management layers
BS/MS degree (or equivalent), preferably in a quantitative discipline (e.g., Computer Science, Computer Engineering)
Experience with on-call support and the ability to respond to emergencies on a 24/7 basis

Nice To Haves

Experience with web analytics tools (preferably Adobe Experience Cloud) is a plus
Experience in the financial services industry is a plus
5–10 years of experience supporting or developing transaction-based systems
Hands-on technical professional who understands both code and infrastructure
Proven operational/support background with strong understanding of incident, problem, and change management to drive stability across organizations
Ability to lead outage incidents by coordinating cross-team response and user communications through resolution
Strong focus on metrics, monitoring, and trend analysis
Strong problem-solving skills with the ability to analyze and interpret data
Ability to build strong relationships and coordinate effectively with multiple parties during outages, while providing clear updates to APG and BU partners
Comfortable with an on-call rotation, including weekend work
Strong end-user support skills; able to partner with users to diagnose issues and drive to resolution
Self-motivated with excellent written and verbal communication; able to communicate clearly and concisely
Strong ownership mentality with a focus on customer satisfaction
Detail-oriented and well organized, with strong analytical skills
Experience working in a virtual and/or global team environment
Self-starter with a “can-do” attitude and the ability to multitask effectively
Familiarity with ITIL concepts, especially incident and problem management

Responsibilities

Proactively detecting, troubleshooting, and resolving all issues affecting production applications. This involves coordination with and escalation to development and external teams where necessary. This team owns all issues escalated to us until it is resolved or a workaround is provided for end users to continue functioning.
Responsible for maintaining clear, concise, and timely communications with affected parties during the investigation and resolution of any individual or system-wide outage
Responsible for the stability of the Production environment
Develop and continually revise in partnership with other teams where necessary) suitable policies and procedures to ensure appropriate application development standards are available to guide development for systems deployed to Production.
As the gatekeepers of the Production environment, responsible for ensuring the Change Implementation Management guidelines/policies are adhered to for all systems deployed to Production.
Responsible for servicing all requests for data or other activities that require access to Production systems
Work with development teams at the appropriate stages in application development to ensure any new systems or projects meet the Production standard
Responsible for maintaining and growing a body of knowledge that is accessible to all team members. Ensure information regarding any support related activities or issues are available and easily accessible. The goal is to improve self-reliance and reduce dependency on the availability of development or external team resources for the initial troubleshooting and resolution of problems.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume