Sr. Infrastructure Site Reliability Engineer

Charles Schwab Inc.•Southlake, TX

1d•$139,000 - $161,000

About The Position

Schwab Technology Services enables the future of how clients manage their money by providing innovative and reliable technology products and services as part of our ongoing commitment to democratize access to investing and financial planning. A Manager for Advisor Services Technology (AST) Infrastructure Operations SRE will lead the strategy, execution, and operational excellence of the application infrastructure ecosystem supporting AST platforms. This role is accountable for ensuring high availability, scalability, reliability and performance through disciplined operational practices, life cycle management, and modern SRE principles. This requires an oversight of all routine and strategic infrastructure initiatives, including operating system upgrades, patching, EOL remediation, infrastructure changes, middleware and database activities, cloud technologies and readiness, tooling modernization, and automation at scale. You will drive holistic capacity management, ensuring that compute, storage, network and application-tier resources are designed and maintained to meet current and future business demand. You will partner closely with architecture and application engineering teams to ensure infrastructure and platform components align with solution designs and support the long-term technical roadmap. The role also governs the organization's observability platforms - defining the telemetry strategy, metrics, SLOs, and alerting posture necessary to maintain operational health and reduce toil. You will lead ongoing improvements in automation, resilience engineering, disaster recovery readiness, and operational maturity, creating repeatable, well-engineered processes that support rapid change with minimal risk. This role requires a deep understanding of enterprise infrastructure and security principles, excellent analytical skills, and the ability to communicate effectively with technical and non-technical stakeholders.

Requirements

Master’s degree in Computer Science, Master of Science, Information Technology Management, Management Information System or a related field.
10+ years of experience in Site Reliability Engineering and Production Operations.
Deep knowledge of application hosting patterns: distributed systems, microservices, message queues, caching, API gateways.
Expertise in managing infrastructure (VMware, Linux, Windows Server, SAN/NAS, Load balancers, Containers- PCF), and configuration management.
Knowledge of cloud platforms (GCP, AWS, Azure) and cloud-native SRE practices.
Proven experience with automation and scripting - observability metrics, and productivity enhancements with scripting languages and tooling like Python, PowerShell, Bash, Ansible, SaltStack, Chef, Terraform.
Strong working experience with observability platforms (Splunk, Grafana, AppDynamics, ITRS, Dynatrace, etc).
Familiarity with secure coding practices and software development methodologies.
Excellent analytical and problem-solving skills to identify, assess, and prioritize production outage resolution effectively.
Strong understanding of service-level objectives (SLOs), error budgets, resilience patterns, and failure-mode analysis.
Solid working knowledge of Schwab resiliency policy - design high availability and disaster recovery architectures.
Experience in security compliance and threat remediation.
Hands-on capacity management experience, analyze and forecast resource utilization.

Nice To Haves

Google Cloud Certification - Associate Cloud Engineer.
Experience in software development, CICD pipeline is beneficial - Bitbucket, Github.
Familiarity with security standards and frameworks.
Knowledge of Veracode and Qualys scans, Chef InSpec, Certificate management and vulnerability remediation.
Knowledge of database platforms - Oracle DB, MsSQL, Postgres, Mongo.
Understanding of networking tools like Wireshark, Nmap, tcpdump, Nagios, JMeter.

Responsibilities

Leads the strategy, execution, and operational excellence of the application infrastructure ecosystem supporting AST platforms.
Ensures high availability, scalability, reliability, and performance through disciplined operational practices, life cycle management, and modern SRE principles.
Oversees routine and strategic infrastructure initiatives, including operating system upgrades, patching, EOL remediation, infrastructure changes, middleware and database activities, cloud technologies and readiness, tooling modernization, and automation at scale.
Drives holistic capacity management, ensuring compute, storage, network, and application-tier resources are designed and maintained to meet current and future business demand.
Partners closely with architecture and application engineering teams to ensure infrastructure and platform components align with solution designs and support the long-term technical roadmap.
Governs the organization's observability platforms, defining the telemetry strategy, metrics, SLOs, and alerting posture.
Leads ongoing improvements in automation, resilience engineering, disaster recovery readiness, and operational maturity.
Practices Site Reliability Engineering mindset and solves problems through automation and instrumentation.
Identifies opportunities to build innovative tools and solve unique operations problems on large enterprise and mission-critical applications.
Drives continuous improvement via automation across infrastructure provisioning, configuration management, compliance, system health, and operational activities.
Monitors the current state of infrastructure to identify deficiencies through aging of technologies or misalignment with business requirements.
Analyzes the business-IT environment to detect critical deficiencies and recommends solutions for improvement.
Governs change management practice, ensuring minimal service impact of infrastructure changes and activities.
Leads capacity planning across compute, storage, and application tiers to ensure scalability and optimization.
Implements proactive monitoring and forecasting to prevent performance degradation across all supported platforms (on-prem and cloud technologies).
Partners with architecture teams to improve system resiliency, failover design, and scalability patterns.
Establishes standards for tooling around runbooks, incident response, and environment configuration.
Leads complex incident triage and root-cause analysis, driving action plans to eliminate recurrences.
Coordinates DR exercises, ensuring process and documentation accuracy, and cross-team alignment.
Oversees Cybersecurity risks, threat and vulnerability programs.