Staff Platform Resilience Event Manager

APEX Fintech ServicesAustin, TX
2dHybrid

About The Position

The Staff Platform Resilience Event Manager is responsible for the strategic planning, coordination, and/or execution of platform resilience events across our technology ecosystem. This includes game day events, disaster recovery testing, business continuity exercises, vendor disaster recovery coordination, and regulatory-driven resilience demonstrations. This role serves as the central orchestrator across the organization to ensure our resilience posture is continuously validated, documented, and improved. You will transform resilience testing into a strategic capability that demonstrates enterprise operational maturity. This is not a traditional event planning role—it requires deep understanding of distributed systems, financial services regulations, incident command structures, and cross-functional program management in a high-stakes environment.

Requirements

  • Bachelor's degree in a technical field (or equivalent work experience) required
  • 10+ years in technology operations, site reliability engineering (SRE), DevOps, or infrastructure roles
  • 3+ years in financial services technology (preferably broker-dealer, clearing, custody, or payments)
  • Hands-on experience with disaster recovery planning and execution in complex, distributed systems environments
  • Experience supporting regulatory examinations and producing compliance documentation
  • Financial Services & Regulatory Knowledge
  • Working knowledge of FINRA, SEC, and financial services regulatory requirements for business continuity and disaster recovery
  • Understanding of third-party risk management in regulated environments
  • Technical Knowledge and Program & Project Management
  • Understanding of cloud infrastructure (AWS/Azure/GCP), database failover, load balancing, and multi-region architectures
  • Familiarity with incident command systems, runbook automation, and monitoring/observability platforms
  • Proven ability to manage complex, cross-functional programs with multiple stakeholders and competing priorities
  • Experience leading high-stakes, time-sensitive events requiring real-time coordination and decision-making
  • Strong project management skills: planning, scheduling, resource coordination, status reporting
  • Comfort with ambiguity and ability to build new programs from the ground up
  • Communication & Leadership
  • Executive presence: able to brief C-suite and board on resilience posture and event outcomes
  • Exceptional written communication: producing regulatory reports, audit evidence, executive summaries
  • Incident command or crisis management experience preferred
  • Ability to influence without authority across technical and non-technical teams

Nice To Haves

  • Certifications: CBCP (Certified Business Continuity Professional), CISSP, GCP/AWS/Azure certifications, ITIL
  • Chaos engineering experience (Chaos Monkey, Gremlin, etc.)
  • Background in internal audit, GRC, or compliance roles
  • Experience with tabletop exercises and red team/blue team scenarios

Responsibilities

  • Resilience Event Strategy & Annual Planning
  • Develop and maintain the annual Platform Resilience Event Calendar spanning all disaster recovery tests, business continuity exercises, game days, and vendor coordination events
  • Align event schedule with regulatory examination cycles, customer audit requests, and internal risk assessment priorities
  • Define success criteria and maturity progression for resilience events (e.g., tabletop → walkthrough → full failover → automated chaos)
  • Maintain our risk register with updates based on resilience event findings
  • Game Day & Chaos Engineering Program Leadership
  • Design and facilitate "game day" exercises that inject controlled failures into production or staging environments to validate system resilience
  • Partner with Engineering, SRE, Product and Ops teams to develop realistic failure scenarios (database outages, network partitions, dependency failures, traffic spikes)
  • Build game day playbooks, observer guides, and scoring rubrics to measure system and team response effectiveness
  • Evolve game day maturity from scheduled events to surprise/unannounced exercises (with appropriate stakeholder buy-in)
  • Vendor Disaster Recovery Coordination
  • Align and coordinate the Platform participation and preparation along with the Enterprise Risk team for DR/BC events
  • In partnership with the Enterprise Risk Team, organize and coordinate vendor-led DR tests, ensuring Apex participation and validation of vendor recovery capabilities
  • Ensure vendor DR documentation (runbooks, RTO/RPO commitments, contact lists) is current and accessible during incidents
  • Ensure inventory of critical third-party vendors is maintained with contractual DR/BC obligations (cloud providers, tech vendors, service providers, SaaS/IaaS/PaaS services)
  • Cross-Functional Stakeholder Alignment
  • Serve as primary liaison between Platform and: Compliance, Legal, Enterprise Risk Management, Internal Audit
  • Integrate Security incident response scenarios into resilience events (e.g., ransomware recovery, insider threat)
  • Translate technical resilience outcomes into compliance artifacts, audit evidence, and regulatory examination responses
  • Coordinate with Legal on customer contractual obligations for DR demonstrations and availability SLAs
  • Regulatory, Audit and Client Contractual Readiness
  • Maintain compliance with FINRA Rule 4370 (Business Continuity Plans), SEC regulations, and state-level financial services resilience requirements
  • Produce request-ready documentation: GameDay Results and findings, Resilience metrics, improvement tracking
  • Support regulatory examinations by providing examiner-requested evidence of resilience testing and improvement trends
  • Track and report findings
  • Metrics, Reporting & Continuous Improvement
  • Define and track key resilience metrics: RTO/RPO actuals, DR test success rates, mean time to failover, game day findings, vendor DR SLA compliance
  • Produce quarterly executive dashboards on resilience posture, event outcomes, and improvement initiatives
  • Maintain centralized repository of runbooks, event after-action reports, and lessons learned
  • Drive continuous improvement by converting event findings into actionable engineering backlogs and process improvements
  • Benchmark Apex resilience maturity against industry standards (e.g., Gartner, NIST, financial services peers)
  • Incident Response Integration
  • Ensure resilience events validate and improve actual incident response capabilities (not just technical recovery)
  • Integrate platform events with ITSM Incident Management training to build muscle memory for real outages
  • Validate incident communication plans during events (customer notifications, executive escalations, status pages)
  • Use real incidents as inputs for future game day scenarios ("let's replay last quarter's outage in a controlled environment")

Benefits

  • healthcare benefits (medical, dental and vision, EAP)
  • competitive PTO
  • 401k match
  • parental leave
  • HSA contribution match
  • paid subscription to the Calm app
  • generous external learning and tuition reimbursement benefits
  • hybrid work schedule for most roles that allows employees to have the flexibility of working from home and one of our primary offices

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service