About The Position

The Senior Incident Optimization Specialist serves as a critical bridge between the Technology Incident Optimization Program and the core Compute, Virtualization, Cloud Services, and Storage technology domains. This role demands deep technical expertise combined with strategic thinking to drive tactical incident reduction while architecting the future state of intelligent event management and automation. You will be responsible for building automated incident remediation workflows and achieving measurable incident reduction within your domain through event optimization, correlation, and automation while ensuring comprehensive observability is maintained and enhanced. This position offers the unique opportunity to shape the future of enterprise event management.

Requirements

  • Education: Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related technical field.
  • Experience: A minimum of 8+ years of hands-on experience in IT operations, infrastructure engineering, or system architecture within large-scale enterprise environments.
  • Event Management & Incident Reduction: Proven experience and demonstrated success in leading event management and incident reduction initiatives with quantifiable results. Direct, hands-on experience with modern AIOps and event management platforms is required.
  • Technical Expertise: Deep understanding of enterprise infrastructure including virtualization architectures, container orchestration, microservices, and various storage architectures (block, file, object). Expertise with a broad range of domain-specific monitoring tools for compute, virtualization, storage, and cloud platforms.
  • Automation & Orchestration: Hands-on experience developing robust automation solutions using scripting languages and modern automation frameworks.
  • Data Analysis: Proficiency in log analysis, pattern recognition, and using query languages for data analysis on log aggregation platforms.
  • Problem-Solving & Analytical Skills: Excellent analytical abilities with a systematic approach to troubleshooting complex issues and a holistic view of technology systems.
  • Communication & Leadership: Exceptional communication skills with the ability to influence and collaborate effectively across diverse, cross-functional teams and present technical concepts to various audiences.

Nice To Haves

  • An advanced degree (Master's) in a relevant technical field.
  • Relevant industry certifications (e.g., Cloud, Virtualization, Automation, ITIL).
  • Experience with AIOps, machine learning for IT operations, and Site Reliability Engineering (SRE) practices.
  • Knowledge of ITSM platforms, CMDB management, and infrastructure-as-code (IaC) principles.
  • Familiarity with financial services regulatory requirements.

Responsibilities

  • Incident & Alert Analysis: Conduct comprehensive analysis of alert and incident patterns to identify top sources of operational noise, determine root causes, and develop data-driven strategies for reduction.
  • Intelligent Event Management: Design, implement, and optimize rules for event correlation, de-duplication, and suppression on AIOps and event management platforms. Develop domain-specific correlation logic leveraging configuration management data and infrastructure topology.
  • Automation & Self-Healing: Architect and develop automation playbooks for incident data enrichment and create self-healing capabilities for common and recurring infrastructure incident scenarios.
  • Observability Enhancement: Assess the current observability footprint across all infrastructure domains to identify gaps and propose enhancements that align with enterprise event management standards.
  • Cross-Functional Collaboration: Partner closely with infrastructure operations, engineering, and platform teams to understand incident drivers, validate correlation logic, and provide expert guidance on event management best practices.
  • Quality Assurance: Continuously validate the effectiveness of implemented rules and automation to ensure no business-impacting alerts are missed. Monitor and report on alert quality metrics and lead iterative improvements.

Benefits

  • In addition to salary, Citi’s offerings may also include, for eligible employees, discretionary and formulaic incentive and retention awards.
  • Citi offers competitive employee benefits, including: medical, dental & vision coverage; 401(k); life, accident, and disability insurance; and wellness programs.
  • Citi also offers paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays.
  • For additional information regarding Citi employee benefits, please visit citibenefits.com.
  • Available offerings may vary by jurisdiction, job level, and date of hire.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service