2K-posted 6 days ago
Full-time • Mid Level
Onsite • Austin, TX
1,001-5,000 employees

We are seeking a highly motivated and experienced Technical Operations Center Lead to manage and mentor our 24/7 Technical Operations Center team. This role is the lynchpin of our live service operations, critical for maintaining the high availability, performance, and reliability of our global game infrastructure. The ideal candidate is a composed, decisive leader with deep technical expertise in incident, problem, and service request management. They must be adept at balancing immediate, high-pressure incident response with strategic, long-term process improvements to optimize all operational workflows and service delivery.

  • Lead the daily operations of the 24/7 TOC team, including prioritization and execution of work, emergency response, and ad hoc duties.
  • Serve as the primary Incident Commander during major production outages, owning the incident life cycle from detection and triage to resolution and executive notification.
  • Manage Service Request fulfillment within the TOC, ensuring that internal requests (e.g., service restarts, access grants, environmental data refreshes) are prioritized, documented, and executed efficiently by the team.
  • Champion the Problem Management process by analyzing trends in recurring incidents, driving Root Cause Analysis, and tracking permanent corrective actions to resolution across engineering teams.
  • Develop, maintain, and facilitate operational procedures, escalation matrices, and comprehensive runbooks for all critical game services and infrastructure.
  • Oversee and optimize our monitoring, alerting, and logging platforms (e.g., Datadog, CheckMK) to ensure effective coverage and minimize alert fatigue.
  • Collaborate with SRE, Development, and QA teams to integrate new services into the TOC's operational scope and improve the observability of services.
  • Mentor and train TOC Engineers and Analysts in advanced troubleshooting techniques, cloud infrastructure fundamentals, and effective service management principles.
  • 5+ years of experience in a Technical Operations Center (TOC), Network Operations Center (NOC), Site Reliability Engineering (SRE), or similar operational role.
  • 2+ years of demonstrated leadership or management experience overseeing a 24/7 team.
  • Deep technical understanding and proven application of IT Service Management (ITSM) concepts, including Incident, Request, and Problem Management.
  • Expertise in formal incident management methodologies (e.g., ITIL, SRE Incident Response).
  • Deep technical understanding of cloud environments (AWS, Azure, or GCP), containerization (Kubernetes), and CI/CD pipelines.
  • Exceptional verbal and written communication skills, with the ability to clearly articulate technical issues, impact, and remediation steps to all levels of the business under high pressure.
  • A continuous learner who is committed to a culture of operational maturity, automation, and proactive problem-solving.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service