Problem Manager

Skyline Technology Solutions, LLCGlen Burnie, MD
28d

About The Position

The IT Operations Problem Manager is a strategic technical leader responsible for building and leading a high-performing problem management function across all Skyline divisions. This role serves as the cornerstone of our operational excellence strategy, combining deep technical expertise with ITIL problem management principles to systematically eliminate recurring incidents and drive continuous improvement across our customers’ infrastructure and services. This position requires a unique blend of technical depth, analytical rigor, and leadership capability. The ideal candidate will architect and implement a problem management process, collaborate and mentor a team of problem analysts, and work hands-on to investigate complex systemic issues across Skyline's technology stack. You can expect to spend your time accomplishing the following: 40% of the time on Objective 1: Problem Management Leadership & Process Ownership 30% of the time on Objective 2: Technical Analysis & Engineering 15% of the time on Objective 3: Collaboration & Stakeholder Management 15% of the time on Objective 4: Continuous Improvement & Innovation

Requirements

  • 5+ years of hands-on experience in problem management or leadership role, with demonstrated success improving reliability
  • 3+ years with DevOps practices, CI/CD pipelines, infrastructure as code (Terraform, Ansible), and configuration management
  • 3+ years using tools like Splunk, LogicMonitor, Prometheus, Grafana, or similar platforms for diagnostics and problem identification
  • 3+ years working with ServiceNow or similar ITSM platforms for problem, incident, and change management
  • 3+ years of deep experience with Linux system architecture, kernel operations, performance tuning, and troubleshooting in enterprise production environments
  • ability to analyze complex, ambiguous situations and extract meaningful patterns and insights
  • Ability to challenge assumptions, ask probing questions, and distinguish correlation from causation
  • Capability to understand and troubleshoot issues spanning multiple interconnected systems and technology domains
  • Strong attention to detail in documenting investigations, findings, and solutions for future reference

Nice To Haves

  • ITIL 4 Managing Professional (MP) or ITIL 4 Strategic Leader (SL) certification (Preferred)
  • Additional Preferred Certifications:
  • Linux certifications (RHCE, LFCS, LFCE)
  • Network certifications (CCNA, CCNP, CWNA, or equivalent)
  • Cloud platform certifications (AWS Certified Solutions Architect, Azure Administrator, GCP Professional)
  • DevOps or SRE certifications (Certified Kubernetes Administrator, DevOps Institute)
  • Six Sigma or other continuous improvement methodologies
  • Experience in designing and managing complex infrastructure environments, including servers, storage, virtualization, and cloud platforms (AWS, Azure, GCP).
  • Experience in enterprise networking and connectivity solutions, encompassing wired/wireless technologies, WAN architectures, and advanced protocols (MPLS, BGP, SD-WAN).
  • Experience in security and monitoring, with hands-on experience in network security technologies, physical security systems, and performance diagnostics.
  • Adept at automation and scripting (Python, Bash, PowerShell) to streamline operations and support modern containerized and microservices-based architectures.

Responsibilities

  • Architect and implement end-to-end problem management processes aligned with ITIL 4 best practices and integrated with existing ITSM workflows
  • Lead problem management from identification and logging through root cause analysis, resolution implementation, and closure
  • Personally lead investigations of major incidents, chronic issues, and high-impact problems requiring deep technical analysis
  • Lead blameless After Incident Reviews that extract maximum learning and drive actionable improvements
  • Apply structured methodologies (5 Whys, Fishbone, Kepner-Tregoe, Fault Tree Analysis) to identify true root causes versus symptoms
  • Own creation of Internal and Customer facing After Incident Summary reports within customer contractual Service Level Agreements (SLA).
  • Establish and maintain the known error database with comprehensive documentation of problems, workarounds, and permanent solutions
  • Work with teams to analyze incident trends, monitoring alerts, and system telemetry to identify emerging problems before they cause a major impact
  • Ensure problems are resolved with permanent fixes, not workarounds; verify effectiveness and prevent recurrence
  • Establish problem prioritization criteria, SLAs, escalation paths, and review boards
  • Understand and map complex interdependencies across Skyline's infrastructure, applications, data flows, and integration points spanning all divisions
  • Analyze system architectures, code, configurations, logs, and performance metrics
  • Leverage Linux knowledge to task teams investigate kernel issues, performance bottlenecks, resource contention, and system-level failures
  • Lead troubleshooting reviews of networking issues across enterprise wired/wireless networks and service provider connections; analyze routing, switching, firewall, and load balancing configurations
  • Identify single points of failure, cascading failure risks, and resilience gaps
  • Decompose systemic issues into discrete technical tasks and effectively assign work to specialized technical resources
  • Evaluate proposed fixes for completeness, sustainability, and potential unintended consequences
  • Participate in the Change Management process and assess how changes, upgrades, and architectural decisions affect system stability and problem recurrence
  • Develop comprehensive dashboards and reports showing problem trends, team performance, business impact, and ROI
  • Provide regular updates to leadership on problem management effectiveness, achievements, and strategic recommendations
  • Continuously improve problem management processes, tools, and procedures based on lessons learned and industry best practices
  • Mentor incident managers and technical teams on effective problem identification, escalation, and initial triage
  • Develop training materials, playbooks, and workshops to elevate organizational problem-solving capabilities

Benefits

  • Medical Insurance
  • Vision Insurance
  • Dental Insurance
  • FSA Plan
  • Paid Time Off
  • 401K Retirement Savings Plan
  • Training & Tuition Assistance
  • Disability & Life Insurance

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

251-500 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service