Charles Schwab-posted 12 months ago
Full-time • Senior
Southlake, TX
5,001-10,000 employees
Securities, Commodity Contracts, and Other Financial Investments and Related Activities

At Schwab, you're empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us "challenge the status quo" and transform the finance industry together. The Sr Manager, Reliability Engineering and Operations is enthusiastic about leading technology teams responsible for delivering exceptional application and production support. You need to have a proven track record of critical thinking skills with laser focus on pragmatic problem solving and production support, and customer satisfaction. We require strong ethics, critical thinking skills, and the ability to partner with and influence business partners, product teams, and technologists across the organization. The right candidate will have a strong background in leading and developing 24 X 7 support teams.

  • Leading and mentoring a Production Operations team for Schwab's Workplace Financial Services Technology team fostering a culture of continuous improvement and innovation
  • Collaborating with cross-functional teams to ensure alignment on reliability and performance goals
  • Being a hands-on technical leader who will lead the team from the front and inspire thought leadership in the team
  • Identifying tactical and strategic opportunities to improve service health, performance, reliability, and telemetry
  • Driving a shift-left mindset and influencing architectural decisions to ensure resiliency and scale at the outset of the software development process
  • Advocating automation to ensure teams are following patterns to ensure repeatability, consistency, and portability
  • Identifying toil and technical debt, developing a comprehensive plan and leading the team through the process of execution
  • Conducting post-mortem reviews to identify areas for improvement and implement solutions to enhance system reliability
  • Implementing and promoting performance engineering practices to ensure optimal system performance
  • Developing and executing strategies for destructive testing to identify potential points of failure and improve system resilience
  • Working closely with the development team to define a sustainable operating model for Mobile applications focusing on platform scale, availability, fault tolerance, and performance
  • Leading the team with a data-driven mindset focusing on addressing key performance metrics such as MTTD, MTTR, Availability in close collaboration with development teams
  • Overseeing production engineering efforts to ensure systems are designed for operational excellence and reliability
  • Providing technical guidance as needed during incidents and daily work
  • Providing leadership around incident management and root cause analysis to resolve production issues and prevent recurrence
  • Establishing and maintaining operational support practices, including monitoring, alerting, and incident response
  • Leading the team in their SRE maturity journey
  • Driving continuous improvement initiatives in reliability, performance, automation, and operational support
  • Staying current with industry trends and best practices to ensure our systems and processes remain in line with SRE tenets
  • 10+ years of experience running and managing 24/7/365 application support teams responsible for enterprise applications, infrastructure, and systems
  • 10+ years of experience in measuring, tracking, improving, and reporting on SLO/SLA's/KPI's
  • 7+ years of experience supporting enterprise applications in production
  • 5+ years of experience working Enterprise ITSM Business Processes
  • ITIL Experience with Enterprise Systems that includes but not limited to Event and Incident Management, Release and deployment, Enterprise Change Management experience
  • Available for after-hours calls/incident management
  • Experience managing multi-shift-based teams
  • Recent experience leading operations organization that focuses on event and incident management
  • Experience in monitoring tools with a focus on ITIL capabilities
  • Experience with GitHub, Bamboo, Bitbucket, Splunk, ThousandEyes, and AppDynamics
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service