About The Position

The Root Cause Engineer (RCA), Mid performs structured root cause analysis for recurring, chronic, or high-impact incidents to identify underlying technical, process, or architectural issues affecting mission-critical federal IT services. This role collects and correlates evidence across logs, traces, metrics, configuration data, and incident records to distinguish underlying causes from symptoms and reconstruct incident timelines. The engineer collaborates with Incident Response, Problem Management, SRE, and engineering teams to document RCA outcomes and define actionable corrective and preventive measures that improve service reliability and reduce recurrence.

Requirements

  • Bachelor’s degree in IT, Computer Science, Business Administration, or related field, or equivalent relevant experience.
  • 4–7 years of experience in IT operations, incident/problem management, reliability engineering, or related roles with significant responsibility for conducting structured RCAs.
  • Strong understanding of ITIL principles, incident and problem management best practices, and proficiency with incident and problem management tools.
  • Demonstrated expertise in at least one structured RCA methodology and ability to coach teams in its use.
  • Strong analytical, problem‑solving, facilitation and communication skills with the ability to manage multiple concurrent investigations effectively.
  • Ability to work collaboratively with cross‑functional technical and business teams in a fast‑paced enterprise IT environment.
  • Active or obtainable SECRET clearance and U.S. citizenship, with less than 10% travel required.

Nice To Haves

  • Hands-on RCA experience in complex enterprises or federal environments.
  • Formal training in RCA or structured problem-solving techniques.
  • Experience using observability platforms, log analytics, and monitoring tools to drive data‑driven incident reconstruction and analysis.
  • Familiarity with reliability engineering concepts (such as SLOs, error budgets, and resiliency patterns) and how they inform RCA priorities and recommendations.

Responsibilities

  • Apply common RCA methodologies (such as 5 Whys, fishbone diagrams, fault tree analysis, and component failure impact analysis) and select appropriate techniques based on incident complexity and impact.
  • Gather and analyze monitoring data, logs, traces, configuration records, service topology maps, and incident timelines, to distinguish contributing factors from true root causes.
  • Facilitate cross-functional RCA sessions with operations, engineering, cybersecurity, and business teams to drive focused discussion, managing differing viewpoints, and converging on agreed causes and remediation actions.
  • Translate RCA findings into corrective and preventive actions aligned with Problem Management workflows.
  • Define and track RCA metrics such as recurrence rates, RCA cycle time, and others using data driven insights to improve analysis quality, timeliness, and effectiveness.
  • Support integration of RCA activities into ITIL-aligned Problem Management and continual service improvement practices.
  • Produce high-quality RCA reports that are audience‑appropriate describing what happened, why it happened, and prevention steps.
  • Identify systemic reliability risks and patterns across incidents.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service