About The Position

The Incident Response Engineer, Senior provides senior‑level technical leadership for resolving complex IT incidents that affect mission‑critical services in a federal enterprise environment. The role leads deep end‑to-end investigations through advanced observability, telemetry analysis, and cross-layer dependency mapping to isolate root causes and validate durable fixes. This position partners closely with incident managers and senior coordinators, engineering, and problem/change management teams to coordinate major events, shape incident response strategy, and elevate diagnostic practices across the operations organization. The senior engineer also drives continuous improvement by refining runbooks, tuning detection and alerting, and mentoring other responders to improve resilience and reduce time to restore.

Requirements

  • Bachelor’s degree in Information Technology, Computer Science, Business Administration, or related field, or equivalent relevant work experience.
  • Minimum of 8 years of experience in incident management, IT operations, reliability engineering, or related IT roles, including frequent responsibility for leading complex, multi‑system incident resolution.
  • Strong mastery of ITIL‑aligned incident management principles and best practices, with demonstrated experience coordinating major incidents in a large enterprise or federal IT environment.
  • Advanced proficiency with incident management tools and modern monitoring/observability platforms used for log analysis, performance monitoring, and alerting.
  • Proven ability to manage multiple complex incidents concurrently, synthesize technical information quickly, and communicate clearly and confidently with both technical teams and leadership.
  • Active or obtainable SECRET clearance and U.S. citizenship, with the ability to satisfy all applicable federal suitability and security requirements.

Nice To Haves

  • Background leading incident response in large‑scale, cloud‑centric, or hybrid environments, including ownership of cross‑team technical coordination and complex investigations.
  • Advanced incident response, cybersecurity, or IT service management certifications (such as higher‑level ITIL, incident‑response‑oriented, or security certifications).
  • Experience embedding incident insights into site reliability engineering practices, including error budgeting, reliability metrics, and capacity planning.
  • Demonstrated success building and refining automation for common remediation actions and verification checks.

Responsibilities

  • Technical Lead (under Major Incident Management direction): Lead complex investigations from scoping through closure; drive hypothesis-based troubleshooting; validate permanent fixes across distributed systems.
  • Observability & Diagnostics: Use modern monitoring/SIEM/observability to correlate metrics, traces, logs; distinguish symptoms from root causes; map impacts across infra/app/network/identity.
  • Runbooks & Automation: Design/refine technical runbooks; implement scripts/orchestration to standardize responses and reduce manual effort; codify remediation/verification checks.
  • SRE & Architecture Integration: Translate incident insights into capacity planning, reliability metrics, and service design changes; partner with platform/reliability engineering teams.
  • Technical PIRs & Coaching: Produce high-quality technical PIRs for engineers/executives; mentor responders in tools, diagnostics, documentation discipline, and IM practice adherence.
  • Cyber IR Interface: Coordinate with SOC/cyber responders when security indicators emerge; align IT ops IR and cyber IR workflows without compromising restoration velocity/safety.
  • Technical Mentoring: coach incident responders and operations staff, raising the bar on diagnostic techniques, tool usage, documentation discipline, and adherence to incident management practices.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service