Director, Production Services Manager

BNY Mellon•New York, NY

10d•$130,000 - $281,000

About The Position

The Head of Production Services Governance, Incident & Problem Management is accountable for the enterprise governance, standards, and performance of Technology Incident Management and Problem Management (including root cause analysis) across BNY’s Platforms. This leader oversees a team that sets the operating model, drives consistent execution, improves quality and speed of restoration, and strengthens auditability and regulatory credibility. The role is the senior point of accountability for firm-wide incident/problem governance and ITIL-aligned standards, high-severity incident command and communications frameworks, end-to-end RCA quality and timeliness, including corrective/preventive actions, regulatory and client-facing incident narratives and responses, internal oversight engagement with groups such as ORR and ERO, and automation and AI augmentation to modernize and scale incident/problem practices. This position partners closely with engineering, SRE/operations, cyber, resiliency, risk, compliance, and business stakeholders to ensure stability, transparency, and continuous improvement of production services.

Requirements

10–15+ years in technology operations, SRE/production services, service management, or resiliency roles in complex enterprises; regulated financial services strongly preferred.
Demonstrated leadership in Major Incident Management and Problem Management/RCA at enterprise scale.
Strong command of ITIL practices (Incident, Problem, Monitoring & Event, Service Level, Change Enablement, Continual Improvement; familiarity with CMDB/Service Configuration is a plus).
Proven experience driving process standardization, operating model change, and measurable performance improvements (e.g., MTTR reduction, recurrence reduction).
Experience leading regulatory/audit-facing responses with strong evidence discipline and executive communication.

Nice To Haves

ITIL 4 Managing Professional (MP) and/or ITIL Strategic Leader (SL); ITIL Foundation minimum.
Familiarity with ISO/IEC 20000, NIST, and resiliency/operational risk expectations in financial services (helpful but not required).
Experience with AIOps platforms/observability tooling (e.g., event correlation, log analytics, tracing, anomaly detection).
Experience with Agile/DevOps/SRE operating models and integrating incident/problem practices into product/platform delivery.

Responsibilities

Own the Incident Management practice and ensure it is implemented consistently across Platform Production Services and aligned to ITIL principles.
Establish and maintain incident taxonomy, severity models, prioritization rules, escalation paths, and functional/organizational RACI.
Define Major Incident Management (MIM) framework: incident command roles, war-room orchestration, communications cadence, stakeholder engagement, and decision rights.
Ensure end-to-end controls: accurate incident logging, categorization, impact assessment, timeline reconstruction, evidence retention, and closure criteria.
Drive performance through standard KPIs (e.g., MTTA/MTTR, reopen rate, SLA compliance, major incident frequency, customer-impact minutes, incident backlog health).
Own the Problem Management practice including proactive problem identification, trending, and prevention of recurrence.
Establish RCA standards (methodologies such as 5 Whys, fishbone, fault tree, “cause–trigger–control gap” framing) and ensure consistent quality across teams.
Govern Corrective and Preventive Action (CAPA) management: remediation backlog, prioritization, due dates, owner accountability, and validation of effectiveness.
Maintain governance for Known Errors and Workarounds, enabling faster recovery and better knowledge reuse.
Drive systemic improvements by connecting incidents/problems to resiliency risks, architectural weaknesses, control gaps, and engineering quality.
Serve as accountable executive for regulatory responses and supervisory requests relating to incidents, outages, recovery actions, RCA findings, and resiliency improvements.
Lead firm readiness for time-sensitive regulatory deliverables—ensuring accuracy, consistency, and defensible evidence.
Coordinate and quality-assure client communications for impactful incidents (internal/external statements, timelines, cause, remediation, and prevention).
Provide clear executive narratives and materials for senior leadership, risk committees, audit committees, and business stakeholders.
Act as the primary interface to internal oversight groups (e.g., ORR, ERO, Operational Risk, Compliance, Internal Audit, and Technology Risk Management).
Ensure incidents/problems are appropriately mapped to relevant governance constructs (e.g., operational risk events where applicable) with clear traceability.
Lead continuous improvement of control coverage and evidence quality to support audits and examinations.
Partner with Resiliency teams to connect operational learning to scenario testing, dependency mapping, recovery planning, and service resiliency metrics.
Build and run a Quality Management System for incident/problem practices: sampling, assurance reviews, coaching, playbooks, and maturity assessments.
Develop and maintain standard artifacts (runbooks, major incident playbooks, comms templates, RCA templates, PIR guidance).
Run Continual Improvement programs: trend analysis, “top drivers” remediation themes, performance benchmarking, and maturity roadmaps.
Drive adoption of consistent tooling, workflows, and data standards across platforms.
Use AI responsibly to improve speed, quality, and scale of incident/problem management while meeting security, privacy, and model-risk expectations.
Lead and develop a high-performing team of incident/problem governance professionals (e.g., problem managers, automation analysts).
Establish role clarity, training paths, and ITIL-aligned capability development.
Foster a culture of calm, disciplined execution during crises and a learning culture post-incident—focused on prevention, not blame.