Director, SRE (PL)

Charles Schwab•Austin, TX

5d•Onsite

About The Position

Your Opportunity We believe in the importance of in-office collaboration and fully intend for the selected candidate for this role to work on site in the specified location(s). At Schwab, you're empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us “challenge the status quo” and transform the finance industry together. We are seeking an experienced SRE Director to lead and scale our Site Reliability Engineering organization. This role requires a proven technology leader who can drive the adoption of advanced tools and methodologies, foster a culture of continuous improvement, and ensure our systems are resilient, secure, and scalable. You will be instrumental in guiding teams through complex AI Ops transformations while empowering them to embrace new technologies and build a high-performance engineering culture. This is not a traditional operations role. We're looking for a leader who embraces the SRE philosophy: treating operations as a software engineering problem, eliminating toil through automation, and using data-driven approaches to balance reliability with velocity. You'll lead the transformation from reactive operations to proactive engineering, where reliability is designed in, not bolted on.

Requirements

10+ years of experience in software engineering, infrastructure, or site reliability roles.
5+ years of people leadership experience managing engineering teams and managers.
Strong software engineering background with proficiency in programming languages (Python, Go, Java, etc.)—this is not an operations-only role.
Deep expertise in cloud platforms (AWS, Azure, GCP) and distributed systems architecture.
Strong background in automation, CI/CD, infrastructure as code, and configuration management.
Proven track record of driving large-scale technical and operational transformations, including AI Ops adoption.
Experience implementing SLO/SLI frameworks and error budget policies.
Experience with observability tools, monitoring platforms, and incident management systems.
Strong understanding of security best practices, compliance requirements, and risk management.
Excellent communication skills with ability to influence stakeholders at all levels.
Ability to articulate the business value of reliability engineering and the ROI of automation investments.

Nice To Haves

Experience with AI/ML operations, AIOps platforms, and intelligent automation.
Background in chaos engineering, game days, and resilience testing.
Knowledge of modern SRE tools and practices (Kubernetes, Terraform, Data Dog, Grafana, etc.).
Experience leading the cultural transformation from traditional IT operations to SRE.

Responsibilities

Lead, mentor, and scale a high-performing team of SRE engineers and managers.
Define and execute the strategic vision for site reliability, availability, and performance across the organization.
Drive the adoption of advanced SRE practices, automation frameworks, and AI-powered operational tools.
Foster a culture of continuous improvement and blameless learning through postmortems—turning failures into opportunities for growth.
Partner with Engineering, Product, and Security teams to align SRE initiatives with business objectives.
Transform traditional operations mindset to SRE culture: shifting from reactive firefighting to proactive system design, from manual processes to software-driven automation.
Ensure systems are resilient, secure, and scalable to meet current and future business demands.
Lead transformation initiatives leveraging AI Ops and intelligent automation to enhance operational efficiency.
Establish and maintain SLIs, SLOs, and error budgets to drive reliability commitments and enable data-driven discussions about acceptable risk.
Lead automation initiatives to eliminate toil and scale operational efficiency—prioritizing code-driven solutions over manual processes.
Drive incident management excellence including root cause analysis, postmortem culture, and continuous learning.
Oversee capacity planning, performance optimization, and infrastructure cost management.
Apply software engineering principles to operations: version control, code review, testing, and CI/CD for all infrastructure and tooling.
Foster collaboration between development and operations teams through SRE principles—breaking down silos and embedding reliability into the development process.