Staff Software Engineer, Incident Management (Hybrid - Acton, MA)

Insulet CorporationSan Diego, MA
3dHybrid

About The Position

The Staff Software Engineer – Incident Management will play a critical role in strengthening Insulet’s ability to respond to and recover from major incidents impacting our platform and services. This role focuses on engineering solutions that improve incident detection, response, and resolution, while partnering closely with Incident Managers, SREs and cross-functional teams. The ideal candidate combines technical expertise with a deep understanding of incident lifecycle management and operational resilience.

Requirements

  • Strong understanding of incident management principles and frameworks (e.g., ITIL).
  • Hands-on experience with incident response in complex, distributed systems.
  • Hands-on experience with conducting post-incident review (blameless post-mortem) sessions.
  • Strong understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and container orchestration technologies (e.g., Kubernetes).
  • Hands-on experience with monitoring and alerting tools (e.g., Datadog, PagerDuty, Prometheus, Grafana).
  • Strong communication and leadership skills, with the ability to collaborate effectively with cross-functional teams.
  • Ability to work under pressure and make decisions during high-impact incidents.
  • Excellent troubleshooting and problem-solving skills.
  • Bachelor’s degree required (preferred field of study: Computer Science, Engineering, or related field).
  • 7+ years of experience in software engineering, operations, or reliability roles.
  • Minimum 3+ years focused on incident management or operational resilience.
  • Proven track record of improving incident response processes and reducing MTTR.

Nice To Haves

  • Experience with cloud platforms (AWS, Azure, or GCP).
  • Understanding of compliance and security requirements in regulated environments.
  • Ability to mentor others on incident response best practices.
  • Proficiency in scripting or automation (Python, Bash, or similar) for operational tasks.

Responsibilities

  • Driving the incident management process and coordinating efforts with all teams involved, including SRE, R&D, IT, vendors, and stakeholder, in resolving the incident.
  • Responding to incidents and initiating the incident management process.
  • Prioritizing incidents according to their urgency and business impact.
  • Coordinating response efforts and collaborating with the incident response team to ensure that all protocols are diligently followed.
  • Communicating with internal stakeholders on major incidents and impacts.
  • Producing documents that outline incident timelines and actions taken during the incident.
  • Coordinating post-incident RCAs with responders and SMEs and communicating to stakeholders.
  • Design and implement automation for incident detection, triage, and resolution.
  • Develop and maintain runbooks, playbooks, and tooling to streamline incident response.
  • Collaborate with Incident Managers to improve processes and reduce Mean Time to Recovery (MTTR).
  • Participate in major incident response efforts, providing technical leadership during high-severity events.
  • Lead post-incident reviews and implement preventive measures to avoid recurrence.
  • Contribute to continuous improvement of incident management frameworks and best practices.
  • Partner with SRE and development teams to embed reliability and resilience into system design.

Benefits

  • Medical, dental, and vision insurance
  • 401(k) with company match
  • Paid time off (PTO)
  • And additional employee wellness programs

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service