Staff Platform Resilience Event Manager

APEX Fintech Services•Austin, TX

51d•Hybrid

About The Position

The Staff Platform Resilience Event Manager is responsible for the strategic planning, coordination, and/or execution of platform resilience events across our technology ecosystem. This includes game day events, disaster recovery testing, business continuity exercises, vendor disaster recovery coordination, and regulatory-driven resilience demonstrations. This role serves as the central orchestrator across the organization to ensure our resilience posture is continuously validated, documented, and improved. You will transform resilience testing into a strategic capability that demonstrates enterprise operational maturity. This is not a traditional event planning role—it requires deep understanding of distributed systems, financial services regulations, incident command structures, and cross-functional program management in a high-stakes environment.

Requirements

Bachelor's degree in a technical field (or equivalent work experience) required
10+ years in technology operations, site reliability engineering (SRE), DevOps, or infrastructure roles
3+ years in financial services technology (preferably broker-dealer, clearing, custody, or payments)
Hands-on experience with disaster recovery planning and execution in complex, distributed systems environments
Experience supporting regulatory examinations and producing compliance documentation
Financial Services & Regulatory Knowledge
Working knowledge of FINRA, SEC, and financial services regulatory requirements for business continuity and disaster recovery
Understanding of third-party risk management in regulated environments
Technical Knowledge and Program & Project Management
Understanding of cloud infrastructure (AWS/Azure/GCP), database failover, load balancing, and multi-region architectures
Familiarity with incident command systems, runbook automation, and monitoring/observability platforms
Proven ability to manage complex, cross-functional programs with multiple stakeholders and competing priorities
Experience leading high-stakes, time-sensitive events requiring real-time coordination and decision-making
Strong project management skills: planning, scheduling, resource coordination, status reporting
Comfort with ambiguity and ability to build new programs from the ground up
Communication & Leadership
Executive presence: able to brief C-suite and board on resilience posture and event outcomes
Exceptional written communication: producing regulatory reports, audit evidence, executive summaries
Incident command or crisis management experience preferred
Ability to influence without authority across technical and non-technical teams

Nice To Haves

Certifications: CBCP (Certified Business Continuity Professional), CISSP, GCP/AWS/Azure certifications, ITIL
Chaos engineering experience (Chaos Monkey, Gremlin, etc.)
Background in internal audit, GRC, or compliance roles
Experience with tabletop exercises and red team/blue team scenarios

Responsibilities

Resilience Event Strategy & Annual Planning
Develop and maintain the annual Platform Resilience Event Calendar spanning all disaster recovery tests, business continuity exercises, game days, and vendor coordination events
Align event schedule with regulatory examination cycles, customer audit requests, and internal risk assessment priorities
Define success criteria and maturity progression for resilience events (e.g., tabletop → walkthrough → full failover → automated chaos)
Maintain our risk register with updates based on resilience event findings
Game Day & Chaos Engineering Program Leadership
Design and facilitate "game day" exercises that inject controlled failures into production or staging environments to validate system resilience
Partner with Engineering, SRE, Product and Ops teams to develop realistic failure scenarios (database outages, network partitions, dependency failures, traffic spikes)
Build game day playbooks, observer guides, and scoring rubrics to measure system and team response effectiveness
Evolve game day maturity from scheduled events to surprise/unannounced exercises (with appropriate stakeholder buy-in)
Vendor Disaster Recovery Coordination
Align and coordinate the Platform participation and preparation along with the Enterprise Risk team for DR/BC events
In partnership with the Enterprise Risk Team, organize and coordinate vendor-led DR tests, ensuring Apex participation and validation of vendor recovery capabilities
Ensure vendor DR documentation (runbooks, RTO/RPO commitments, contact lists) is current and accessible during incidents
Ensure inventory of critical third-party vendors is maintained with contractual DR/BC obligations (cloud providers, tech vendors, service providers, SaaS/IaaS/PaaS services)
Cross-Functional Stakeholder Alignment
Serve as primary liaison between Platform and: Compliance, Legal, Enterprise Risk Management, Internal Audit
Integrate Security incident response scenarios into resilience events (e.g., ransomware recovery, insider threat)
Translate technical resilience outcomes into compliance artifacts, audit evidence, and regulatory examination responses
Coordinate with Legal on customer contractual obligations for DR demonstrations and availability SLAs
Regulatory, Audit and Client Contractual Readiness
Maintain compliance with FINRA Rule 4370 (Business Continuity Plans), SEC regulations, and state-level financial services resilience requirements
Produce request-ready documentation: GameDay Results and findings, Resilience metrics, improvement tracking
Support regulatory examinations by providing examiner-requested evidence of resilience testing and improvement trends
Track and report findings
Metrics, Reporting & Continuous Improvement
Define and track key resilience metrics: RTO/RPO actuals, DR test success rates, mean time to failover, game day findings, vendor DR SLA compliance
Produce quarterly executive dashboards on resilience posture, event outcomes, and improvement initiatives
Maintain centralized repository of runbooks, event after-action reports, and lessons learned
Drive continuous improvement by converting event findings into actionable engineering backlogs and process improvements
Benchmark Apex resilience maturity against industry standards (e.g., Gartner, NIST, financial services peers)
Incident Response Integration
Ensure resilience events validate and improve actual incident response capabilities (not just technical recovery)
Integrate platform events with ITSM Incident Management training to build muscle memory for real outages
Validate incident communication plans during events (customer notifications, executive escalations, status pages)
Use real incidents as inputs for future game day scenarios ("let's replay last quarter's outage in a controlled environment")

Benefits

healthcare benefits (medical, dental and vision, EAP)
competitive PTO
401k match
parental leave
HSA contribution match
paid subscription to the Calm app
generous external learning and tuition reimbursement benefits
hybrid work schedule for most roles that allows employees to have the flexibility of working from home and one of our primary offices

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume