About The Position

Senior Director, Reliability and Resiliency Oversight Capital One is one of the fastest growing organizations in the world today, powered by our passion for our customers. We are serious about technology, we dream big, and we execute: Capital One moved our entire enterprise to the public cloud over the course of five years. Just as we prioritize driving innovation through technology, we equally prioritize cybersecurity, reliability, software quality, and data management. Technology & Data Risk Management (TDRM) is a small organization that packs a big punch. The ~200 professionals in TDRM are trusted experts who oversee ~14,000 developers at Capital One. We raise the bar for excellence in cybersecurity, reliability, tech risk, and data management risk. We shape strategy and decisions, challenge activities to ensure they meet our standards, and perform independent tests of our security and technology risk. For years, the cybersecurity community has debated whether the CISO should report to the CIO or not. In regulated financial services, the answer is: both. The first-line CISO has operational responsibilities and reports to the CIO. The second-line Chief Tech Risk Officer (CTRO) and the Tech & Data Risk Management (TDRM) organization have broader responsibilities for cybersecurity but also reliability, software quality, resilience, and the risk of failing to manage our data. The CTRO is independent and oversees the work of the CISO, the CIO/CTO, and the Chief Data Officer. The CTRO reports to the Chief Risk Officer, who reports directly to the CEO. Our business leaders must constantly make technology decisions. TDRM makes sure they have the tech and data risk information they need to make good decisions. Associates within TDRM are highly-skilled information security, cybersecurity, site reliability engineering, technology, data analyst, data scientist, and risk management professionals. They have a wealth of experience and a demonstrated ability to add value with their advice and to deliver high-impact results. As the Senior Director, Resiliency and Reliability Oversight, we are looking for a hands-on technical leader focused on enterprise-scale systems architecture and operational stability. This is a pivotal and high-impact role responsible for shaping the strategic vision and execution of second-line advisory and oversight across multiple functions. You will directly influence how thousands of mission-critical systems are designed, deployed, and managed to ensure maximum uptime and durability. Technical credibility, architectural judgment, and the ability to influence engineering culture are critical to success. What You’ll Do Strategic Leadership Represent the second-line risk management function in architecture councils and Site Reliability Engineering (SRE) forums to ensure a rigorous "design-for-failure" lens is applied to major cloud initiatives and multi-region strategies. Cultivate a culture of reliability by staying at the forefront of reliability engineering, automated recovery patterns, and distributed system architecture; mentor risk and engineering teams on balancing feature velocity with system stability. Lead a high-impact team of technical resiliency advisors focused on advising and challenging the first line’s recoverability and durability of mission-critical cloud ecosystems. Enterprise Influence Partner broadly across the enterprise to identify and assess evolving risks to system availability in a fast-moving environment. You will advise on architecture decisions regarding blast-radius containment, failover strategies, and disaster recovery capabilities. Build and maintain deep relationships with technical leaders, architects, and engineers ensuring that availability risks are transparent and well-understood by key stakeholders. Executive Communication: Draft and communicate independent reports to inform a broad audience including engineers, executives, the Board of Directors, and regulators on the organization’s current reliability posture and risk environment.

Requirements

  • Bachelor’s Degree or military experience
  • At least 5 years of experience in Site Reliability Engineering (SRE), Disaster Recovery, or high-availability architecture
  • At least 10 years of experience in infrastructure operations, software development, or systems architecture

Nice To Haves

  • At least 8 years of direct hands-on experience designing and governing distributed systems or large-scale enterprise infrastructure.
  • At least 5 years of people leadership experience.
  • Master’s Degree in Computer Science or an Engineering discipline.
  • Experience leading enterprise-wide resiliency transformations (e.g., implementing multi-region failover, automated recovery, or Chaos Engineering practices).
  • Ability to communicate complex technical risks clearly to all levels of the organization and drive consensus across competing priorities.
  • Familiarity with controls and frameworks related to operational risk (e.g., NIST 800-34, ISO 22301, COBIT, or Digital Operational Resilience acts).
  • Prior experience working in financial services or other highly regulated sectors.

Responsibilities

  • Represent the second-line risk management function in architecture councils and Site Reliability Engineering (SRE) forums to ensure a rigorous "design-for-failure" lens is applied to major cloud initiatives and multi-region strategies.
  • Cultivate a culture of reliability by staying at the forefront of reliability engineering, automated recovery patterns, and distributed system architecture; mentor risk and engineering teams on balancing feature velocity with system stability.
  • Lead a high-impact team of technical resiliency advisors focused on advising and challenging the first line’s recoverability and durability of mission-critical cloud ecosystems.
  • Partner broadly across the enterprise to identify and assess evolving risks to system availability in a fast-moving environment. You will advise on architecture decisions regarding blast-radius containment, failover strategies, and disaster recovery capabilities.
  • Build and maintain deep relationships with technical leaders, architects, and engineers ensuring that availability risks are transparent and well-understood by key stakeholders.
  • Draft and communicate independent reports to inform a broad audience including engineers, executives, the Board of Directors, and regulators on the organization’s current reliability posture and risk environment.

Benefits

  • Capital One offers a comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being. Learn more at the Capital One Careers website.
  • This role is also eligible to earn performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI).
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service