IT Support Manager

RaceTrac•Atlanta, GA

115d

About The Position

The Support Manager is responsible for ensuring the availability, stability, and performance of Loyalty platform. This role leads incident response, operational support, and continuous service improvement initiatives to ensure systems remain secure, highly available, and capable of delivering exceptional service working with the Engineering teams and business. The ideal candidate brings strong experience in observability platforms such as Dynatrace and Azure Application Insights, advanced KQL-based diagnostics, and the ability to troubleshoot complex integrations within distributed systems and cloud platforms. What You'll Do: Serve as the primary point of contact for all major incidents related to the team’s technologies and platforms. Lead and coordinate incident response activities, ensuring rapid resolution and effective communication during critical events. Manage offshore support team, service availability, and operational processes for the team while monitoring workloads to maintain appropriate resource levels. Ensure effective implementation and governance of the Incident and Problem Management process, including reporting and post-incident reviews. Identify, initiate, schedule, and conduct incident reviews and root cause analysis to drive long-term resolution and prevent recurrence. Monitor incidents to ensure SLA adherence and operational performance targets are consistently met. Ensure the proper closure and documentation of all resolved incidents after end-user confirmation. Maintain strong business-level understanding of critical applications supporting key operational areas. Utilize Dynatrace and Azure Application Insights to monitor application performance, detect anomalies, and proactively identify system degradation. Leverage Kusto Query Language (KQL) to analyze telemetry, logs, and metrics to diagnose production issues. Troubleshoot complex integrations across distributed systems, including APIs, event-driven architectures, cloud services, and third-party platforms. Proactively identify potential issues requiring remediation and develop action plans in collaboration with business and IT teams. Produce daily, weekly, and monthly operational reports to demonstrate SLA compliance and system performance. Establish strong relationships across RaceTrac to effectively coordinate resolution during critical incidents. Maintain awareness of all operational and event-related activities and provide regular updates to senior management. Drive continuous process improvement by reviewing incident management processes, operational workflows, tools, and technologies.

Requirements

Bachelor’s degree from an accredited college or university in Information Technology or related field, or equivalent work experience.
5+ years of experience in production support, incident management, or site reliability engineering environments.
Strong experience with observability and monitoring platforms, particularly: Dynatrace and Azure Application Insights.
Experience managing support teams (FTE or third-party resources).
Advanced experience with Kusto Query Language (KQL) for log analysis and telemetry diagnostics.
Demonstrated ability to troubleshoot complex integrations in distributed cloud-based platforms.
Knowledge of IT infrastructure, networking, and cloud environments, with strong preference for Microsoft Azure.
Experience with IT Service Management (ITSM) processes and tools; ServiceNow experience is a plus.
Strong analytical, problem-solving, and communication skills, especially during high-severity incidents.

Responsibilities

Serve as point of contact for all major incidents related to their team’s technologies.
Manage all support staff, service availability and operational processes for the team.
Monitor workloads to ensure appropriate # of resources is maintained.
Responsible for the effective implementation of the Incident Management process and carrying out the respective reporting procedure.
Identify, initiate, schedule and conduct incident reviews.
Monitor incidents to ensure SLA adherence.
Ensure the closure of all resolved and end-user confirmed Incident records.
Maintain business level knowledge of what are the key applications for each business area.
Proactively identify and evaluate potential issues that may need remediation and create a plan of action to work with the business and IT to remediate
Create daily/weekly/monthly operations reporting as required to establish that all required SLA's are being met for internal and external clients.
Establish relationships across RaceTrac in order to fully support any critical incident as required.
Maintain awareness of all event-related activities and provide on-going communication and feedback to senior management.
Establish continuous process improvement cycles where the process performance, activities, roles and responsibilities, policies, procedures and supporting technology is reviewed and enhanced where applicable.