Corporate Vice President - Lead Site Reliability Engineer

New York Life•New York, NY

52d•Hybrid

About The Position

We are seeking a highly skilled site reliability engineer (SRE) to join our IT Operations team. The site reliability engineer (SRE) role is responsible for enabling innovation and velocity of change while ensuring system reliability focusing on the critical features and functionality within products and platforms. It collaborates with the business or product owners to prioritize operational requirements by defining service-level indicators (SLIs) and service-level objectives (SLOs) to monitor and optimize customer's journey and experience. Our goal is to improve the stability of existing platforms and in parallel design and operate scalable resilient systems utilizing modern software engineering principles. In the role you will analyze service management incident management, problem management, change management, and release management date to identify persistent problems. You will then improve monitoring and observability and implement corrective actions. You are also encouraged to recommend changes to our architecture to increase performance and stability. Successful reliability outcomes are likely to implement and extend on DevOps and Agile ways of working and associated automation approaches. These are underpinned by the site reliability engineer’s solid understanding of systems, production environments, operational insights, incident management, on-premises, cloud and hybrid world. The nature of the work involved means that the site reliability engineer will directly engage with customer teams but will also work on reliability initiatives that span multiple teams. The site reliability engineer collaborates closely with product owners and teams, architects, IT service management, software developers, security and network engineers, as well as other subject matter experts and roles, particularly in infrastructure and operations. Being an approachable team player and a good communicator is therefore crucial for success, and a willingness to lead initiatives is important. The site reliability engineer leads root cause analysis in areas such as deployment activities, event management, incident and problem management, availability, capacity and service-level management, as well as service continuity and scalability.

Requirements

Education: Bachelor's degree in Information Technology, Computer Science, or a related field
Experience: 3+ years in software engineering, DevOps, SRE, or related disciplines
Essential: AWS Certification, Experience supporting Salesforce and Salesforce integrations, Strong programming skills: Java, JavaScript, SQL, API development, Experience with Terraform and infrastructure-as-code, Knowledge of SLIs/SLOs, observability, and performance metrics
Strong technical expertise in network infrastructure and platform support.
Excellent analytical and problem-solving abilities.
Proven ability to manage high-pressure situations and resolve complex issues.
In-depth knowledge of network protocols and services (TCP/IP, DNS, DHCP, VPN).
Proficiency in using network monitoring and troubleshooting tools (e.g., Wireshark, SolarWinds, Nagios).
Experience with various server operating systems (Windows, Linux, AMI) and cloud platforms (AWS, Azure).
Strong communication and interpersonal skills.
Ability to work independently and as part of a team.
Commitment to continuous learning and improvement.

Responsibilities

Define and mature SRE practices, including SLO/SLI frameworks and error-budget governance.
Design and implement automation solutions using Java, JavaScript, APIs, SQL, and Terraform.
Investigate and resolve application performance bottlenecks by analyzing code, queries, APIs, and data flows.
Optimize data-processing pipelines, ETL components, and backend services for improved throughput and latency.
Deliver application-level fixes and enhancements through disciplined software engineering.
Focus on key reliability and performance indicators: uptime, system throughput, system output, and download rate/application load speed.
Partner with the NYL Platform Engineering Team to shift from non-standard application platforms to standard software artifacts (Terraform modules, secure base images, YAML templates, Java libraries) integrated into CI/CD pipelines, creating reusable patterns and reducing repetitive configuration and coding tasks.
Provide expert support and troubleshooting across network and enterprise service issues, ensuring minimal disruption to business operations.
Support various platforms, including Windows, Linux, macOS, and cloud environments (e.g., AWS, Azure).
Respond to and resolve incidents in a timely manner, providing clear communication to stakeholders throughout the process.
Build monitoring, observability dashboards, and alerting systems, Monitor network and platform performance, identifying and addressing potential issues proactively ensure to address gaps identified during troubleshooting efforts.
Maintain detailed records of issues, actions taken, and outcomes to support continuous improvement efforts.
Work closely with other IT teams and external vendors to resolve complex issues and implement solutions.
Identify opportunities to improve support processes and implement best practices to enhance overall efficiency.
Provide training and guidance to IT staff on network and platform support techniques and best practices.