Vice President, Head of Infrastructure Resiliency

AssetMark•Charlotte, NC

1d•Hybrid

About The Position

As the Head of Platform Resiliency & Operations, you are accountable for operating and engineering the reliability, scalability, and resilience of AssetMark’s platform. This role owns production operations today—including environments, batch processing, incident response, and day-to-day platform management—which are currently operationally intensive. Your mandate is to transform this reality by driving an engineering-first approach to production management and infrastructure. You will lead a fundamental shift: from reactive, manual operations to proactive, automated, and engineered reliability—while continuing to deliver a high-quality, always-on platform for our clients. This role has a twofold mandate: Deliver on our client commitment by operating a high-availability, high-resiliency platform where reliability is a defining feature of the product Enable high-velocity product development by building systems, tooling, and practices that allow Product & Engineering to move fast without compromising stability. We can only consider candidates for this position who are able to accommodate a hybrid work schedule and are close to our Charlotte, NC office.

Requirements

Strong background in Software Engineering or Systems Engineering; you lead reliability through code, not process alone
Deep expertise in distributed systems, failure modes, and large-scale platform architecture
Passion for observability, SLOs, and data-driven reliability management
Experience owning production operations for mission-critical systems
Track record of transforming manual, operations-heavy environments into automated, engineering-led platforms
Experience building and scaling SRE and/or Platform Engineering capabilities
Strong incident leadership experience with a focus on blameless culture and systemic improvement
Demonstrated ability to drive behavioral change across Engineering and Infrastructure teams
Experience embedding operational rigor into the software development lifecycle (SDLC)
Ability to balance reliability with product velocity through data-driven tradeoffs
Strong partner to Product, Engineering, and Infrastructure leadership
Able to communicate clearly with executives during high-pressure incidents
Deep understanding of reliability as a core business capability in financial services
Candidates must be legally authorized to work in the US to be considered.
We are unable to provide visa sponsorship for this position.

Nice To Haves

Accommodate a hybrid work schedule and are close to our Charlotte, NC office.

Responsibilities

Own 24/7 production operations for mission-critical systems, including incident management, batch processing, and environment stability
Lead the transformation of production operations from manual, reactive processes to automated, engineering-driven systems
Establish an engineering-first mandate to eliminate manual toil and operational overhead
Drive systematic improvements in reliability, scalability, and operational efficiency
Define and operationalize Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across all critical systems
Establish and govern Error Budgets to balance product velocity with platform stability
Drive measurable reduction in operational toil through automation and engineering solutions
Embed reliability targets into planning and decision-making across teams
Apply Site Reliability Engineering (SRE) principles to quantify and manage reliability
Build full-stack observability (metrics, logs, traces) to improve detection and diagnosis of issues
Evolve monitoring into deep observability with actionable alerting and reduced alert fatigue
Establish resilience testing practices (e.g., game days, fault injection)
Drive automated incident response and self-healing systems
Institutionalize blameless post-mortems focused on systemic improvement
Leverage SRE practices for incident learning and continuous improvement
Ensure all infrastructure is managed via Infrastructure as Code (IaC) for consistency, scalability, and recovery
Own reliability and operational integrity of CI/CD pipelines, including automated release gating
Build self-service platforms and tooling that enable engineering teams to deploy and operate services safely
Modernize batch processing and environment management through automation and engineering rigor
Establish shared accountability for reliability between Platform, SRE, and Software Engineering teams
Partner with Engineering to co-deliver reliability improvements and conduct joint post-incident reviews
Influence engineering practices including production readiness, safe deployments, and observability standards
Ensure reliability is embedded early in the software development lifecycle
Define and enforce reliability standards for third-party vendors and platform dependencies
Establish SLIs/SLOs for external services and manage vendor performance accordingly
Map and govern system dependencies to prevent cascading failures