Problem Manager

Leidos
Remote

About The Position

Leidos has an opening for a Problem Manager supporting a large healthcare contract in Baltimore. This role is currently telework / remote and supports the Centers for Medicare & Medicaid Services (CMS) mission. The Problem Manager leads structured problem investigations, coordinates root cause analysis efforts across technical teams, and drives corrective and preventive actions that improve service stability, reduce repeat incidents, and strengthen operational performance. This position is best suited for a practitioner who is comfortable working across incident management, problem management, change management, and service operations in a complex enterprise environment. The role requires sound judgment, clear writing, effective facilitation, and enough technical depth to work credibly with infrastructure and application teams. What You Will Do Lead end-to-end problem management activities for high-impact or recurring service issues. Coordinate and facilitate root cause analysis efforts using evidence from incident records, monitoring tools, bridge notes, logs, vendor inputs, and stakeholder interviews. Build clear timelines, identify contributing factors, document a single confirmed root cause when supported by evidence, and track open unknowns when evidence is incomplete. Produce high-quality post-incident and problem analysis reports for technical, operational, and leadership audiences. Drive corrective actions, preventive actions, lessons learned, and follow-up improvement work to closure. Partner with Incident Management, Change Management, Operations, Engineering, and service owners to reduce repeat failures and improve resilience. Monitor trends across incidents and problems to identify recurring patterns, systemic weaknesses, and opportunities for service improvement. Support service review discussions and help ensure performance, availability, and operational commitments are being met. Contribute operational insight to change planning, service improvement initiatives, and related program efforts as needed. AI and Analysis Expectations This role is expected to use approved AI tools responsibly to improve the speed and consistency of analysis, documentation, and reporting. The Problem Manager is not expected to build AI systems, but should understand how to use AI as a disciplined support capability within operational guardrails. That includes: using AI to organize large volumes of incident evidence, reconstruct timelines, summarize known facts, and draft RCA content; preserving uncertainty and clearly separating confirmed facts from assumptions or open questions; validating AI-generated content against source evidence before use; following governance, privacy, documentation, and auditability requirements when using AI in operational workflows; using prompt structure, reference context, and persona-based guidance to improve output quality in approved environments. Basic SRE / Reliability Expectations This role does not require expert-level SRE knowledge, but it does require practical familiarity with core reliability concepts so the Problem Manager can work effectively with engineering and operations teams. Candidates should be comfortable with: incident lifecycle and escalation coordination; observability inputs such as alerts, logs, dashboards, and monitoring trends; service reliability concepts such as availability, performance, resiliency, and operational risk; understanding the relationship between incidents, known errors, changes, and recurring failure patterns; discussing mitigation, recovery, rollback, and prevention actions with technical teams; distinguishing evidence from inference during live incidents and post-incident review.

Requirements

  • Bachelor’s degree with 8–12 years of relevant experience, or Master’s degree with 6–10 years of relevant experience. Relevant experience may be considered in lieu of degree.
  • Experience in Problem Management, Major Incident Management, Service Operations, or related ITSM functions in a complex enterprise environment.
  • Demonstrated experience facilitating root cause analysis and driving corrective actions across multiple teams.
  • Strong written and verbal communication skills, including the ability to present technical issues clearly to leadership and stakeholders.
  • Strong collaboration, coordination, and conflict management skills.
  • Ability to work independently in a dynamic environment while maintaining strong follow-through and accountability.
  • Practical technical familiarity with enterprise infrastructure and operations, with working knowledge in several of the following areas: Windows, Linux, UNIX, networking, firewalls, middleware, storage, mainframe, cloud operations, or data center operations.
  • Ability to obtain and maintain a Public Trust clearance.
  • All candidates supporting CMS programs must have lived in the United States at least three (3) of the last five (5) years to be considered.

Nice To Haves

  • Experience supporting federal or healthcare environments.
  • Working knowledge of ITIL / ITSM practices, especially Incident, Problem, Change, and Knowledge Management.
  • Experience using approved AI platforms to support analysis, drafting, summarization, and structured investigation workflows.
  • Ability to create effective prompts, use reference material correctly, and review AI outputs for accuracy, completeness, and policy compliance.
  • Familiarity with Agile ways of working.
  • Familiarity with service reliability practices, operational metrics, and continuous improvement methods.

Responsibilities

  • Lead end-to-end problem management activities for high-impact or recurring service issues.
  • Coordinate and facilitate root cause analysis efforts using evidence from incident records, monitoring tools, bridge notes, logs, vendor inputs, and stakeholder interviews.
  • Build clear timelines, identify contributing factors, document a single confirmed root cause when supported by evidence, and track open unknowns when evidence is incomplete.
  • Produce high-quality post-incident and problem analysis reports for technical, operational, and leadership audiences.
  • Drive corrective actions, preventive actions, lessons learned, and follow-up improvement work to closure.
  • Partner with Incident Management, Change Management, Operations, Engineering, and service owners to reduce repeat failures and improve resilience.
  • Monitor trends across incidents and problems to identify recurring patterns, systemic weaknesses, and opportunities for service improvement.
  • Support service review discussions and help ensure performance, availability, and operational commitments are being met.
  • Contribute operational insight to change planning, service improvement initiatives, and related program efforts as needed.

Benefits

  • Employment benefits include competitive compensation, Health and Wellness programs, Income Protection, Paid Leave and Retirement.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service