About The Position

We are seeking a skilled and experienced Problem Manager to join our enterprise Site Operations organization. In this role you will own the end-to-end problem management lifecycle for our SaaS production environment, lead blameless root-cause investigations, and drive systemic engineering and operational improvements that increase platform availability, reliability, and customer satisfaction. You will partner closely with SRE, Engineering, Product, Customer Support, and Cloud Infrastructure teams to convert incident learnings into durable fixes and measurable reliability gains.

Requirements

  • 5+ years of experience in SaaS operations, SRE, incident response, or problem management in enterprise environments.
  • Bachelor’s degree in Computer Science, Information Systems, Engineering, or equivalent experience.
  • Demonstrated experience leading RCAs and driving cross-functional remediation in cloud-native systems (AWS, Azure, or GCP).
  • Strong working knowledge of distributed systems concepts, microservices, containers (Kubernetes/Docker), CI/CD, and Infrastructure as Code.
  • Proficiency with observability and incident tooling (examples: Datadog, Prometheus, Splunk, New Relic, or equivalent) and ITSM platforms (e.g., ServiceNow, JSM, PowerBI, Appfire).
  • Proven ability to influence engineering teams and stakeholders without direct authority and to manage competing priorities.

Nice To Haves

  • ITIL v4 certification or equivalent experience implementing ITIL-aligned practices.
  • Experience in multi-region, high-availability SaaS deployments and with formal SLO/SLA/error budget management.
  • Familiarity with chaos engineering, capacity planning, and reliability engineering practices.
  • Experience working in regulated industries or environments with strict compliance/audit requirements.

Responsibilities

  • Own and govern the problem management process: identification, triage, prioritization, remediation tracking and closure.
  • Lead facilitation of blameless post-incident reviews and structured RCA sessions (e.g., 5 Whys, Fishbone).
  • Produce and maintain high-quality postmortems and remediation plans; ensure timely execution and verification of corrective actions.
  • Translate operational failures into prioritized engineering work and track closure through to verification.
  • Monitor and analyze incident trends and recurring failure modes; recommend and coordinate systemic mitigations.
  • Align problem management with SLAs/SLOs, error budget practices, and availability targets.
  • Drive cross-functional accountability and escalate material reliability risks to leadership with clear impact analysis.
  • Partner with Observability, Release Engineering, and Security teams to close monitoring, testing, and dependency gaps.
  • Define and track problem-management metrics and reports for leadership (e.g., problem aging, action completion rate, recurring incident rate).
  • Maintain compliance with governance and change-control requirements applicable to enterprise SaaS operations.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service