About The Position

This Head of Production Operations & Resiliency Services is accountable for the operational excellence, availability, and resilience of all Retirement & Insurance technology platforms serving millions of participants and managing hundreds of billions in assets. This role leads a complex ecosystem where approximately 70% of production operations are delivered through managed services partnerships, requiring exceptional vendor governance, operational discipline, and the ability to build high-performing hybrid operational models. This leader will strengthen our operational foundation while simultaneously transforming toward Site Reliability Engineering (SRE) practices—balancing the immediate need for enterprise-grade stability with the strategic imperative to automate, instrument, and engineer reliability into our systems at scale. They will also lead operational readiness and production stability for a major core platform transformation while establishing the operational excellence framework that will define Retirement & Insurance technology for the next decade. STRATEGIC ACCOUNTABILITY (WHAT SUCCESS LOOKS LIKE): Operational Excellence: Own availability, performance, and resilience targets across all Retirement & Insurance production systems. Deliver measurable improvements in MTTR (Mean Time to Resolution), change success rates, and proactive issue detection. Vendor Ecosystem Orchestration: Govern and optimize a complex managed services portfolio ensuring accountability, cost efficiency, and service level achievement. Transform vendor relationships from transactional to strategic partnerships. SRE Transformation: Build the roadmap and capabilities to evolve from reactive TechOps to proactive Site Reliability Engineering practices—introducing observability, automation, error budgets, and engineering culture into operations. Business Continuity & Resilience: Ensure disaster recovery readiness, incident response excellence, and crisis leadership that protects TIAA's reputation and participant trust during high-stakes operational events. Platform Transformation Leadership: Serve as operational anchor for major platform migrations and technology modernization initiatives, ensuring production stability throughout complex transitions.

Requirements

  • 10+ years in large-scale production operations, infrastructure management, or site reliability engineering roles
  • Minimum 7 years in leadership roles managing distributed operations teams and/or complex managed services partnerships
  • Proven track record managing mission-critical systems in highly regulated industries (financial services, healthcare, insurance) with stringent availability and compliance requirements
  • Demonstrated success leading operational stability during major platform migrations, data center transitions, or core system transformations
  • Deep expertise in vendor management including contract negotiation, SLA enforcement, and building accountability frameworks for offshore and third-party providers
  • Hands-on background as infrastructure engineer, systems administrator, or site reliability engineer providing credibility with technical teams
  • Bachelor’s degree in computer science, Engineering, Information Systems, or related technical field
  • Availability for 24/7 incident escalation (this is a production accountability role)
  • Expert knowledge of ITIL/ITSM frameworks (Incident Management, Problem Management, Change Management) and modern SRE practices (SLOs, error budgets, observability, toil reduction)
  • Strong understanding of enterprise infrastructure including compute, storage, networking, databases, middleware, and integration platforms (on-premises and cloud)
  • Experience with observability and monitoring tools such as Splunk, AppDynamics, Dynatrace, Datadog, or similar APM platforms
  • Familiarity with cloud operations (AWS, Azure, GCP) and hybrid cloud operational models
  • Knowledge of CI/CD pipelines, automation frameworks (Ansible, Terraform), and DevOps toolchains
  • Understanding of mainframe and legacy system operations alongside modern distributed architectures (experience with mainframe-to-modern migrations a plus)
  • Working knowledge of disaster recovery, business continuity planning, and high-availability architectures
  • Exceptional crisis leadership skills with proven ability to remain calm, decisive, and transparent during high-pressure operational incidents
  • Strong vendor negotiation and influence skills driving outcomes without direct authority over third-party teams
  • Excellent executive communication translating technical operational issues into business impact and risk language
  • Strategic thinking balancing immediate operational stability needs with long-term transformation initiatives
  • Talent development mindset coaching teams through cultural and technical transformations
  • Demonstrated ability to build trust and credibility with business partners, development teams, and executive stakeholders

Nice To Haves

  • Experience in retirement, insurance, or wealth management technology with understanding of recordkeeping, participant transactions, or financial administration systems
  • Experience with AWS, Azure, and/or GCP cloud computing platforms
  • Track record transforming traditional TechOps organizations toward SRE/DevOps culture and practices
  • Familiarity with regulatory requirements affecting financial services technology operations (SOC2, SOX, SEC regulations)
  • Experience with Agile/SAFe methodologies collaborating closely with product and development teams
  • Technical certifications such as AWS Solutions Architect, ITIL Expert, or Google SRE
  • Prior experience at large financial institutions or FinTech companies managing operations on a scale
  • Master’s degree
  • Agile Methodology
  • Analytical Skills
  • Automation
  • Cloud Platforms
  • Configuration Management
  • Data Management
  • Infrastructure Deployment
  • Infrastructure Support
  • IT Infrastructure
  • Network Administration/Maintenance
  • Problem Solving
  • Programming
  • Project Management
  • Relationship Management
  • Technology Systems

Responsibilities

  • Production Operations & Service Delivery (40%) Accountable for 24/7/365 production operations across Retirement & Insurance technology platforms including recordkeeping systems, participant portals, financial transaction processing, and business-critical applications Define and enforce Service Level Objectives (SLOs), availability targets, and operational KPIs aligned with business requirements and regulatory obligations Lead production change management processes ensuring disciplined risk assessment, rollback planning, and deployment coordination across development, infrastructure, and vendor teams Oversee capacity planning, performance optimization, and scalability management to support business growth and seasonal demand patterns Drive continuous improvement in operational metrics: uptime, MTTR, change success rates, proactive monitoring coverage, and automation maturity Partner closely with Infrastructure teams on compute, storage, network capacity planning, cloud migrations, and platform optimization initiatives to ensure production environments meet availability and performance targets Collaborate with Cybersecurity teams on security incident response, vulnerability remediation in production systems, security patching strategies, and embedding security controls into operational processes without compromising system availability
  • Managed Services & Vendor Governance (30%) Govern relationships with multiple managed services providers delivering infrastructure, application support, monitoring, and incident management capabilities Enforce vendor SLA compliance, conduct regular performance reviews, and drive accountability through data-driven scorecards and escalation frameworks Optimize managed services portfolio for cost efficiency while maintaining or improving service quality Negotiate contracts, SOWs, and operational models that align vendor incentives with TIAA business outcomes Build internal capabilities in areas where vendor performance gaps exist or strategic control is required Ensure seamless operational integration across internal teams, offshore partners, and third-party service providers
  • Incident Management & Crisis Leadership (15%) Own enterprise incident management framework including severity definitions, escalation paths, communication protocols, and executive reporting Serve as Incident Commander for Severity 1/2 incidents affecting Retirement & Insurance operations, orchestrating cross-functional response teams, and ensuring timely resolution Coordinate joint incident response with Infrastructure and Cybersecurity teams during complex outages, security events, or infrastructure failures requiring integrated troubleshooting and resolution Drive blameless postmortem culture focused on systemic improvement rather than individual fault-finding Establish and maintain disaster recovery and business continuity plans with regular testing and validation Partner with Enterprise Risk, Compliance, and Business Continuity teams to meet regulatory requirements and audit expectations Communicate operational health, risks, and incidents to executive leadership with transparency and appropriate urgency
  • SRE Transformation & Automation (10%) Build the multi-year roadmap to introduce Site Reliability Engineering practices including error budgets, SLO-based decision making, toil reduction, and automation-first culture Partner with development teams to embed reliability requirements earlier in the software lifecycle Implement observability strategy leveraging modern APM, logging, and monitoring platforms to enable proactive issue detection Automate repetitive operational tasks (deployments, monitoring, incident response runbooks) to improve efficiency and reduce human error Develop talent pipeline transitioning traditional TechOps professionals toward SRE/DevOps engineering capabilities
  • Leadership & Stakeholder Management (5%) Lead and develop a high-performing operations team blending internal staff and managed services partners Serve as primary operational liaison to Infrastructure leadership ensuring alignment on platform roadmaps, capacity investments, cloud strategy, and shared operational standards Work in lockstep with Cybersecurity leadership to balance security requirements with operational stability, jointly own security incident response procedures, and integrate security automation into production operations Collaborate with Technology Leaders, Product Owners, Architecture, and Business stakeholders to align operational priorities with business objectives Communicate complex operational issues and trade-offs to non-technical executives in clear, business-oriented language Foster culture of operational excellence, accountability, continuous learning, and psychological safety Represent Retirement & Insurance Operations in enterprise forums, steering committees, and strategic planning sessions

Benefits

  • The organization is committed to making financial well-being possible for its clients, and is equally committed to the well-being of our associates. That’s why we offer a comprehensive Total Rewards package designed to make a positive difference in the lives of our associates and their loved ones. Our benefits include a superior retirement program and highly competitive health, wellness and work life offerings that can help you achieve and maintain your best possible physical, emotional and financial well-being.
  • To learn more about your benefits, please review our Benefits Summary

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Director

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service