About The Position

We are looking for a US-based Senior Director that would be a strategic, operational, execution, and escalation owner for all the site, infrastructure and cloud platform services. This role is personally accountable for the production reliability and stability, including owning US time-zone incidents, Sev 0/1 events, leading cutovers, and directly representing site, infrastructure and platforms to executive leadership during high-impact events. The expectation is that this leader stands front of the line during critical incidents and events like migration and stabilization, makes real-time decisions, and clearly articulates risk, impact, and trade-offs to executives under pressure. This front-line ownership is intentional but transitional. A core measure of success for this role is building the systems, operating model, delegation structure, and a strong leadership bench such that sustained, high-quality operations do not depend on the continuous personal presence of a single leader. The leader is expected to design for leverage: establishing clear ownership, developing managers/leaders, and embedding practices that scale reliability beyond individual heroics. In parallel, they are expected to lead the full lifecycle of our infrastructure transformation, from data center exit and AWS migration through steady-state cloud operations and platform maturity. Success is measured not just by completing the migration, but by leaving behind a durable operating model with clear delegation, on-call ownership, and predictable executive engagement. The ideal candidate will have personally led large-scale data center exits and Cloud migrations, not just advised or governed them.

Requirements

  • 15+ years of experience in infrastructure, Cloud, SRE, or platform engineering.
  • 7+ years leading large engineering organizations (managers of managers or equivalent).
  • Direct, hands-on leadership of at least one full data center exit and AWS migration, including decommissioning of on-premise infrastructure.
  • Deep technical expertise in AWS, including VPC networking, EC2, EKS/Kubernetes, RDS/Aurora, S3, IAM, and observability tooling.
  • Strong experience operating highly available, distributed systems using SRE principles.
  • Proven ability to lead complex, high-risk infrastructure transformations in production environments.
  • Expertise in FinOps and cloud cost optimization practices.
  • Demonstrated ability to drive standards and adoptions across distributed engineering teams without relying on reporting lines.
  • Skillful operating as a front-line executive leader during critical situations, including migrations, upgrades, DR, incidents, and major production events.

Responsibilities

  • Lead Data Center Exit & AWS Migration
  • Establish and own the post-migration operating model for cloud infrastructure and platforms, explicitly tied to outcomes:
  • Hold teams accountable for post-migration reliability metrics, including:
  • Lead physical and logical data center decommissioning only after post-migration SLOs are consistently met and incident KPIs have stabilized.
  • Build and Evolve the Cloud Platform
  • Own the vision, roadmap, and execution for the company’s cloud platform, ensuring it supports both migration needs and long-term, steady-state operations on AWS.
  • Deliver self-service, opinionated platform services that improve developer productivity while meeting security and reliability standards.
  • Modernize legacy and architect for Multi-Tenant SaaS:
  • Partner closely with application and product engineering to ensure the platform accelerates delivery while maintaining reliability and compliance.
  • Incident Management, SRE & Operational Resilience Leadership
  • Own and evolve the end-to-end incident management lifecycle for infrastructure and platform services, grounded in SRE principles of reliability, learning, and automation.
  • Define and enforce SLIs, SLOs, and error budgets for platform and infrastructure services, using them to guide operational decisions, release risk, and incident response.
  • Lead the transition from incident response as heroics to incident prevention by design, embedding reliability, AI,capacity planning, and failure-mode analysis into platform roadmaps and change processes.
  • Serve as the executive escalation owner for Sev 0 and Sev 1 incidents, personally leading response, trade-off decisions, and executive communications when required, while delegating incident command to empowered leaders to ensure sustained coverage.
  • Hold clear decision authority under pressure, including the ability to unilaterally halt or roll back changes, trigger failovers/traffic-shifts and disaster recovery actions, reallocate engineering resources in demanding situations, and make go/no-go cutover decisions to protect customers and data escalating to executive leadership when actions materially impact regulatory posture, contractual commitments, or significant financial exposure.
  • Build and maintain a US-based SRE and incident leadership bench, with multiple leaders capable of acting as Incident Commander, owning executive updates, and coordinating cross-functional response.
  • Lead through error budgets and reliability signals to drive blameless postmortems, root-cause analysis, and prioritization of systemic fixes over short-term feature velocity.
  • Own the systematic reduction of operational toil and capacity tax across infrastructure and platform teams, with clear accountability for ensuring reactive work declines as systems mature.
  • Hold teams accountable to measurable toil and resilience KPIs, such as percentage of engineer time spent on reactive work, on-call interrupt frequency, manual intervention rates, and incident recurrence.
  • Influence readiness through game days, chaos testing, and migration-specific drills, validating both technical resilience and delegation models under pressure.
  • Ensure incident management tooling, observability (metrics, logs, traces), and documentation are standardized, well-owned, and continuously improved.
  • Program, Stakeholder, and Executive Leadership
  • Partner with product, engineering, security, enterprise architecture, and finance to shape cloud migration and platform decisions that directly impact cost-to-serve, unit economics, and operational overhead, ensuring infrastructure choices scale sustainably with business growth.
  • Drive architectural and platform standards that reduce total cost of ownership, including infrastructure spend, support burden, reliability overhead, and on-call load.
  • Embed FinOps and Reliability signals (utilization, reliability cost, incident-driven spend, operational toil) into platform roadmaps and migration sequencing, making trade-offs explicit between performance, resilience, speed, and cost.
  • Translate infrastructure and platform choices into clear business outcomes such as per-customer cost, per-transaction cost, and support effort, enabling executives to make informed investment and prioritization decisions.
  • Act as a trusted advisor on infrastructure and cloud strategy, challenging assumptions and translating complex technical risks into clear business impact, options, and trade-offs to enable informed decision-making under pressure.
  • Build and delegate clear ownership and accountability for cloud migration timelines, risks, and outcomes.
  • Establish clear governance, readiness reviews, and success metrics for migration and platform initiatives.
  • Partner and guide steering committees, technical working groups, and cross-organizational readiness forums.
  • People and Organization Leadership
  • Own the design, scale, and effectiveness of the Cloud Platform Engineering organization, including SRE, cloud infrastructure, and platform engineering teams across geographies.
  • Build and lead a strong leadership bench, developing senior managers, principal engineers, and architects who can operate independently at scale.
  • Clearly define delegation, decision rights, and escalation paths so that critical incidents, migrations, and operational responsibilities are owned at the right level.
  • Drive organizational clarity across charters, roles, responsibilities, and decision rights to reduce friction and increase delivery velocity.
  • Actively recruit, retain, and develop top-tier infrastructure, SRE, and platform talent, including succession planning for critical roles.
  • Establish a culture of engineering excellence, reliability, and continuous improvement, grounded in data, post-incident learning, and blameless accountability.
  • Lead change management during periods of transformation, including data center exit, cloud migration, and operating model shifts.
  • Foster strong partnerships with product, application engineering, security, and business leaders, ensuring platform teams are seen as strategic enablers and not service providers.
  • Champion diversity of thought, inclusive leadership, and high team engagement across a growing, global organization.

Benefits

  • Our Utah office features onsite perks such as company-paid meals, massage therapists, a sports simulator, gym, mother’s lounge, and meditation room and meaningful interactions with amazing people.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Director

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service