Disaster Recovery and Major Incident Response Manager

Hospital for Special Surgery•New York, NY

About The Position

How you move is why we’re here. ® Now more than ever. Get back to what you need and love to do. The possibilities are endless... Now more than ever, our guiding principles are helping us in our search for exceptional talent - candidates who align with our unique workplace culture and who want to maximize the abundant opportunities for growth and success. If this describes you then let’s talk! HSS is consistently among the top-ranked hospitals for orthopedics and rheumatology by U.S. News & World Report. As a recipient of the Magnet Award for Nursing Excellence, HSS was the first hospital in New York City to receive the distinguished designation. Whether you are early in your career or an expert in your field, you will find HSS an innovative, supportive and inclusive environment. Working with colleagues who love what they do and are deeply committed to our Mission, you too can be part of our transformation across the enterprise.

Requirements

Bachelor’s degree in computer science, Information Technology, Business Administration or a related field
8+ years of experience in IT operations, major incident management, disaster recovery, or service management
Strong demonstrated experience acting in an incident command or major incident leadership role.
Strong understanding of: Disaster recovery and business continuity concepts Application tiering and dependency management Infrastructure and application recovery strategies
Proven ability to lead cross‑functional teams during high‑pressure incidents.
Effective communication skills and executive‑presence.
This role holds operational command responsibility during declared disasters and major IT incidents.
Afterhours, weekend, and holiday availability are essential functions of the role.
Success requires the ability to drive execution, accountability, and remediation through influence across multiple teams.

Nice To Haves

Experience in healthcare, regulated, or audit‑driven environments.
Familiarity with ITIL, ISO 22301, NIST, or similar frameworks
Experience leading and supporting large, complex application portfolios or programs.
Experience working with third‑party vendors during recovery events.

Responsibilities

Serve as Active or Backup Responder on Duty (AROD/BROD) on a scheduled rotation for major incidents, declared disasters, and extended outages.
Function as the single point of command and escalation during DR and major incidents in coordination with the Executive on Duty (EOD).
Assess incident severity and determine when to escalate to disaster recovery activation in collaboration with IT EOD and business leadership.
Coordinate cross‑functional response efforts involving infrastructure, application teams, cybersecurity, vendors, and business operations.
Lead real‑time incident coordination calls and ensure clear task assignment, escalation, and decision tracking.
Oversee internal and external communications related to service outages, recovery progress, and restoration status.
Ensure continuous 24/7 readiness by validating that playbooks, contact lists, tooling access, vendor support, and escalation paths are current, accessible, and executable at all times.
Ensure effective shift handoffs, documentation continuity, and leadership coverage during prolonged or multi‑day incidents.
Own the activation and execution of disaster recovery plans and runbooks during declared events.
Coordinate technical recovery activities across infrastructure, platform, and application teams.
Ensure application recovery is validated by appropriate application and business owners prior to declaring service restoration.
Maintain operational oversight for prolonged recovery efforts, including shift coverage, resource planning, and vendor engagement.
Ensure recovery actions are executed in accordance with approved DR standards, policies, and tiering requirements.
Partner with the DR/BC Governance function to maintain enterprise DR readiness across all application tiers.
Own the creation, maintenance, and continuous improvement of disaster recovery and major incident playbooks to ensure they are: Present for all in‑scope applications Technically accurate and executable Reviewed and validated on a defined cadence.
Partner with IT Operations, Infrastructure, Applications, and Cybersecurity teams to validate technical accuracy and operational effectiveness of disaster recovery and major incident playbooks.
Support application tiering decisions and ensure recovery strategies align to business impact and risk tolerance.
Lead the planning, execution, and facilitation of disaster recovery testing, tabletop exercises, and simulations; ensure findings are documented and tracked to closure.
Ensure exercise outcomes, identified gaps, and remediation actions are documented, tracked, and resolved within defined timeframes.
Ensure DR processes align with internal policies, regulatory requirements, and audit expectations.
Lead the creation of Root Cause Analysis (RCA) documents and/or postmortem reviews following major incidents and disaster recovery events.
Ensure lessons learned, control gaps, and process improvements are documented and assigned to accountable owners.
Track remediation actions through completion and provide status updates to leadership and governance committees.
Identify recurring incident patterns or recovery risks and recommend corrective actions.
Develop and present actionable, data‑driven recommendations to IT and business leadership to improve disaster recovery readiness, response effectiveness, and operational resilience, including recovery strategy enhancements, tooling gaps, staffing models, and escalation processes.
Provide regular status updates and briefings to IT leadership, business partners, and governance committees.
Escalate recovery risks, resource constraints, or unresolved issues to executive leadership as appropriate.
Partner with business continuity leaders to ensure alignment between IT recovery and operational continuity procedures.