Vice President, Head of Infrastructure Resiliency

AssetMarkCharlotte, NC
Hybrid

About The Position

As the Head of Platform Resiliency & Operations, you are accountable for operating and engineering the reliability, scalability, and resilience of AssetMark’s platform. This role owns production operations today—including environments, batch processing, incident response, and day-to-day platform management—which are currently operationally intensive. Your mandate is to transform this reality by driving an engineering-first approach to production management and infrastructure. You will lead a fundamental shift: from reactive, manual operations to proactive, automated, and engineered reliability—while continuing to deliver a high-quality, always-on platform for our clients. This role has a twofold mandate: Deliver on our client commitment by operating a high-availability, high-resiliency platform where reliability is a defining feature of the product Enable high-velocity product development by building systems, tooling, and practices that allow Product & Engineering to move fast without compromising stability. We can only consider candidates for this position who are able to accommodate a hybrid work schedule and are close to our Charlotte, NC office.

Requirements

  • Strong background in Software Engineering or Systems Engineering; you lead reliability through code, not process alone
  • Deep expertise in distributed systems, failure modes, and large-scale platform architecture
  • Passion for observability, SLOs, and data-driven reliability management
  • Experience owning production operations for mission-critical systems
  • Track record of transforming manual, operations-heavy environments into automated, engineering-led platforms
  • Experience building and scaling SRE and/or Platform Engineering capabilities
  • Strong incident leadership experience with a focus on blameless culture and systemic improvement
  • Demonstrated ability to drive behavioral change across Engineering and Infrastructure teams
  • Experience embedding operational rigor into the software development lifecycle (SDLC)
  • Ability to balance reliability with product velocity through data-driven tradeoffs
  • Strong partner to Product, Engineering, and Infrastructure leadership
  • Able to communicate clearly with executives during high-pressure incidents
  • Deep understanding of reliability as a core business capability in financial services
  • Candidates must be legally authorized to work in the US to be considered.
  • We are unable to provide visa sponsorship for this position.

Nice To Haves

  • Accommodate a hybrid work schedule and are close to our Charlotte, NC office.

Responsibilities

  • Own 24/7 production operations for mission-critical systems, including incident management, batch processing, and environment stability
  • Lead the transformation of production operations from manual, reactive processes to automated, engineering-driven systems
  • Establish an engineering-first mandate to eliminate manual toil and operational overhead
  • Drive systematic improvements in reliability, scalability, and operational efficiency
  • Define and operationalize Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across all critical systems
  • Establish and govern Error Budgets to balance product velocity with platform stability
  • Drive measurable reduction in operational toil through automation and engineering solutions
  • Embed reliability targets into planning and decision-making across teams
  • Apply Site Reliability Engineering (SRE) principles to quantify and manage reliability
  • Build full-stack observability (metrics, logs, traces) to improve detection and diagnosis of issues
  • Evolve monitoring into deep observability with actionable alerting and reduced alert fatigue
  • Establish resilience testing practices (e.g., game days, fault injection)
  • Drive automated incident response and self-healing systems
  • Institutionalize blameless post-mortems focused on systemic improvement
  • Leverage SRE practices for incident learning and continuous improvement
  • Ensure all infrastructure is managed via Infrastructure as Code (IaC) for consistency, scalability, and recovery
  • Own reliability and operational integrity of CI/CD pipelines, including automated release gating
  • Build self-service platforms and tooling that enable engineering teams to deploy and operate services safely
  • Modernize batch processing and environment management through automation and engineering rigor
  • Establish shared accountability for reliability between Platform, SRE, and Software Engineering teams
  • Partner with Engineering to co-deliver reliability improvements and conduct joint post-incident reviews
  • Influence engineering practices including production readiness, safe deployments, and observability standards
  • Ensure reliability is embedded early in the software development lifecycle
  • Define and enforce reliability standards for third-party vendors and platform dependencies
  • Establish SLIs/SLOs for external services and manage vendor performance accordingly
  • Map and govern system dependencies to prevent cascading failures

Benefits

  • Flex Time or Paid Time Off and Sick Time Off
  • 401K – 6% Employer Match
  • Medical, Dental, Vision – HDHP or PPO
  • HSA – Employer contribution (HDHP only)
  • Volunteer Time Off
  • Career Development / Recognition
  • Fitness Reimbursement
  • Hybrid Work Schedule
  • competitive benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service