Incident Manager

Crusoe•San Francisco, CA

58d

About The Position

This Incident Manager role is critical for upholding service reliability and customer trust, directly impacting company success by minimizing downtime and resolving critical issues. You will spearhead the management of high-visibility incidents and customer escalations, ensuring rapid and effective responses to complex technical challenges. Beyond immediate resolution, we are looking to sharpen our incident management practices to ensure a superior customer experience during "storms" as well as robust preventative measures afterward. You will leverage data analytics to drive greater resiliency and reliability, ensuring that every incident translates into a stronger product and process.

Requirements

Core Tech Stack: Strong technical experience with Linux, Virtualization, Kubernetes, and handling customer incidents.
Certifications: We are looking for candidates who actively update their skill sets. NVIDIA, Linux, and Kubernetes certifications are strongly preferred to demonstrate a deep understanding of the products our CSEs and CSMs support.
Networking & Infrastructure: Solid understanding of the TCP/IP stack and Infrastructure-as-Code (IaC) practices.
Experience: 4-5 years of customer-facing experience and 3-5+ years’ experience in a team leadership role acting as a liaison with external/internal customers.
Crisis Handling: A proven track record in crisis management, capable of navigating high-pressure situations with a focus on customer experience.
Problem Solving: A proven problem-solving mindset with the ability to diagnose and resolve complex technical issues.
Communication: Excellent communication skills, both written and verbal.

Nice To Haves

Bonus Skills: Programming skills with one or more programming languages.

Responsibilities

Handle the "Storm": Lead incident responses for high-visibility issues, ensuring minimal disruption to customer operations. You will act as the calm anchor during crises, managing communication and strategy to maintain customer trust during outages or critical failures.
Analytics & Reliability: Utilize data analytics to identify trends in incidents, translating these insights into actionable strategies for greater system resiliency and reliability.
Preventative Strategy: Develop robust incident response strategies and designs. Focus on the "preventative piece" by conducting deep post-incident reviews to ensure root causes are addressed and recurrences are eliminated.
Troubleshoot and Resolve: Diagnose and resolve complex technical issues related to Infiniband, containerization, and distributed training.
Implement and Optimize: Guide and assist customers in implementing and optimizing their HPC infrastructure to achieve maximum performance and efficiency.
Educate and Empower: Develop and deliver training materials, including internal training sessions, documentation, and knowledge base articles, to empower customers to effectively utilize our solutions.
Collaborate Internally: Work closely with internal engineering and product teams to provide valuable customer feedback. You will act as a key technical resource, helping our Customer Support Engineers (CSEs) and Customer Success Managers (CSMs) understand and resolve complex product issues.

Benefits

Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid Commuter FSA benefit of $200 per month