Incident and Problem Manager

NorthMark Strategies•Dallas, TX

About The Position

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation. The Position The Incident & Problem Manager is accountable for establishing and operating the Incident Management and Problem Management practices within NMC², ensuring that service disruptions are resolved quickly, root causes are identified and eliminated, and lessons learned drive continuous improvement across the ITSM ecosystem. This combined role owns the full lifecycle of reactive and proactive service restoration; from initial detection and triage through resolution, root cause analysis, and known error documentation, ensuring minimal business impact and sustained service reliability. The ITSM team is responsible for ensuring the reliability and stability of services across NMC²’s infrastructure and operations. The Incident & Problem Manager owns the end-to-end lifecycle of service disruptions, ensuring rapid restoration, effective escalation, and long-term resolution of underlying issues. Working alongside Service Desk, Engineering, Data Center Operations, and vendors, you will lead major incident response, drive root cause analysis, and implement continuous improvement across the ITSM ecosystem. This role plays a critical part in maintaining service availability and improving operational maturity at scale.

Requirements

Bachelor’s Degree or equivalent experience
5+ years of experience in IT Service Management, with ownership of Incident and/or Problem Management
Proven experience managing major incidents in high-availability or mission-critical environments
Hands-on experience with Jira Service Management or similar ITSM tooling
Strong understanding of incident lifecycle management, escalation, and service restoration
Experience conducting root cause analysis and driving long-term remediation
Strong analytical and problem-solving skills, with the ability to identify trends in operational data
Excellent communication skills with the ability to coordinate across technical and non-technical teams
Must be legally authorized to work in the United States without the need for employer sponsorship, now or at any time in the future.

Nice To Haves

ITIL certification or equivalent experience preferred

Responsibilities

Own and manage the end-to-end major incident process, acting as the primary escalation point for high-severity incidents
Lead incident response efforts, coordinating cross-functional teams to restore service as quickly as possible
Define and improve incident and problem management processes, ensuring consistent execution and high-quality data in Jira Service Management
Drive root cause analysis and problem management activities, ensuring recurring issues are identified and permanently resolved
Maintain and leverage a Known Error Database to document workarounds and solutions
Analyze incident trends and performance metrics to identify systemic issues and improvement opportunities
Partner with engineering, service owners, and change management to implement fixes and prevent recurrence
Produce regular reporting on KPIs such as MTTR, SLA performance, and incident trends

Benefits

Company-Paid Lunch Stipend : Lunch is provided via GrubHub
Company-Paid Benefits: 100% Employer-Paid Medical in our High Deductible Health Plan, Dental and Vision benefits for employees and their families, 16 weeks of Paid Parental Leave, Employee Assistance Program, Life insurance, Short-Term Disability and Long-Term Disability
401(k): Company will match 100% of your contributions up to 6%
Optional Employee-Paid Benefits: Medical insurance in our PPO plan and a variety of other benefits such as Health Savings Accounts (with Company Contribution!), Flexible Spending Accounts, Supplemental Life Insurance, Wellhub and more.
Time Off: 25 days of Paid Time Off plus 12 company holidays

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Incident and Problem Manager

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Similar Incident and Problem Manager job opportunities

Tools

Career Hubs

Guides

Company