Platform Engineer Lead - Disaster Recovery and Resiliency

Capital GroupIrvine, CA
6h$179,273 - $286,837

About The Position

As a Platform Engineer Lead - Disaster Recovery and Resiliency , you will be responsible for the operational side of disaster recovery and resilience. As a platform engineering lead, you are going to be developing, implementing, and maintaining resiliency framework and capabilities, that applications teams can consume via automated product offerings or repeatable patterns to attest to validity and viability of their disaster recovery plans in line with business outcomes. You will partner with infrastructure and application teams to design and implement scripts, templates, and workflows that automate their product’s disaster recovery. This includes automation for all relevant resiliency elements including disaster recovery provisioning and scaling, configuration management, monitoring and observability, resyncing and reconciliation, and testing. You will work and partner closely with the project managers, technical leads, and business stakeholders to identify testing scenarios for potential threats, assess impacts, and design testing solutions to ensure business continuity and minimize risks. You will perform detailed evaluations of platform and application resiliency readiness to identify areas of concern. You will conduct regular testing, monitoring, and reporting of the resiliency and disaster recovery plans and activities. You will develop the capability to capture the book of record for all disaster recovery related data. You will identify gaps and continuous improvement opportunities. You can design and implement data collecting scripts, implement and maintain monitoring tools, and develop front-end dashboards to monitor the health, performance, and utilization of Capital’s recovery environment to enable prompt response when signs dictate. You will support Global Risk and their requirements to report to regulators on our disaster recovery effort.

Requirements

  • 7+ years of hands‑on experience in resiliency, disaster recovery, or business continuity for midsize to large enterprises, with proven technical leadership delivering enterprise‑scale DR and resiliency solutions, preferably in regulated or financial services environments.
  • You have a bachelor's degree in computer science, information systems, engineering, or a related field.
  • Strong AWS platform engineering expertise, including hands-on experience with AWS Resiliency Hub and AWS Fault Injector Service, and the ability to design, implement, validate, and operationalize AWS first resiliency and recovery strategies across cloud, hybrid, and on prem environments.
  • Demonstrated ability to design and evolve enterprise disaster recovery and resiliency frameworks, delivering repeatable, automated patterns and scalable solutions consumable by application teams.
  • Deep experience with Infrastructure as Code (IaC) and automation first delivery, using tools such as Terraform, Ansible, Chef, or Puppet, along with strong knowledge of CI/CD principles, resiliency frameworks, DR testing strategies, chaos engineering, and risk based analysis.
  • Proven technical leadership and communication skills, with the ability to lead cross‑functional delivery, influence without authority, conduct DR and resiliency testing, support regulatory and risk reporting, and clearly communicate outcomes to both technical and non‑technical stakeholders.

Nice To Haves

  • Coding experience (e.g., Python, JavaScript) and relevant certifications (CBCP, CRISC, CISA) are a plus.

Responsibilities

  • Developing, implementing, and maintaining resiliency framework and capabilities
  • Partnering with infrastructure and application teams to design and implement scripts, templates, and workflows that automate their product’s disaster recovery
  • Working and partnering closely with the project managers, technical leads, and business stakeholders to identify testing scenarios for potential threats, assess impacts, and design testing solutions
  • Performing detailed evaluations of platform and application resiliency readiness to identify areas of concern
  • Conducting regular testing, monitoring, and reporting of the resiliency and disaster recovery plans and activities
  • Developing the capability to capture the book of record for all disaster recovery related data
  • Identifying gaps and continuous improvement opportunities
  • Designing and implementing data collecting scripts, implement and maintain monitoring tools, and develop front-end dashboards to monitor the health, performance, and utilization of Capital’s recovery environment
  • Supporting Global Risk and their requirements to report to regulators on our disaster recovery effort

Benefits

  • competitive salary
  • bonuses and benefits
  • company-funded retirement contribution
  • generous time-away and health benefits from day one, with the opportunity for flexible work options
  • 2-for-1 matching gifts for your charitable contributions and the opportunity to secure annual grants for the organizations you love
  • on-demand professional development resources
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service