Senior Incident Optimization Specialist - Data & Middleware VP

Citi•Irving, TX

23d•Onsite

About The Position

The Senior Incident Operations & Optimization Specialist for Data & Middleware is a specialized technical leadership role requiring deep expertise in database technologies, messaging platforms, and application middleware. This position is essential to the Incident Reduction Program, as database and middleware systems generate significant operational incidents while serving as critical infrastructure for enterprise applications. You will be responsible for building automated incident remediation workflows and achieving measurable incident reduction through intelligent correlation, threshold optimization, and automation while ensuring the health and performance of business-critical data and middleware platforms remain visible and protected. This role offers the opportunity to modernize observability and event management for the data layer and integration tier of enterprise architecture.

Requirements

Experience: A minimum of 8+ years of hands-on experience in database administration, middleware engineering, or enterprise data platform operations.
Event Management & Incident Reduction: Proven experience in event management, alert tuning, and incident reduction for data and middleware services, with measurable results. Direct, hands-on experience with modern AIOps and event management platforms is required.
Technical Expertise: Deep knowledge of both relational (e.g., Oracle, SQL Server) and NoSQL (e.g., MongoDB) database technologies, including clustering, replication, and performance tuning. Expertise in middleware platforms, including messaging technologies (e.g., MQ, Kafka) and application servers (e.g., WebSphere, Tomcat).
Automation & Scripting: Hands-on experience developing robust automation solutions using relevant scripting languages (e.g., Python, Shell) and modern automation frameworks.
Data Analysis: Proficiency in log analysis, pattern recognition, and using query languages for data analysis on log aggregation platforms.
Problem-Solving & Analytical Skills: Excellent analytical abilities with a systematic approach to troubleshooting complex data platform architectures and correlating infrastructure issues with application impact.
Communication & Leadership: Exceptional communication skills with the ability to collaborate effectively with DBAs, middleware engineers, and application teams, and to present technical concepts to diverse audiences.

Nice To Haves

An advanced degree (Master's) in a relevant technical field.
Relevant industry certifications (e.g., Database, Middleware, Cloud, Automation, ITIL).
Experience with Database as a Service (DBaaS) platforms and other database technologies.
Knowledge of data governance, security, and compliance requirements in a regulated environment.
Background in large-scale financial services environments.
Experience with modern observability platforms, distributed tracing, and infrastructure-as-code (IaC) principles.

Responsibilities

Incident & Alert Analysis: Analyze and optimize monitoring across all database and middleware platforms to address high-volume, low-value alerts, identify patterns in incident generation, and determine root causes.
Intelligent Event Management: Develop and implement domain-specific correlation, de-duplication, and suppression rules on AIOps and event management platforms. Create logic that understands database cluster relationships, messaging dependencies, and application-to-database connections.
Automation & Self-Healing: Architect and develop automation playbooks for incident data enrichment and automated remediation of common database and middleware issues, such as connection pool resets or service restarts.
Observability Enhancement: Identify monitoring gaps across the data and middleware landscape, proposing enhancements to ensure comprehensive health monitoring and address blind spots in transactional flows.
Cross-Functional Collaboration: Partner closely with Database Administration (DBA), middleware engineering, and application teams to validate correlation logic, build consensus on threshold changes, and provide expert guidance on event management best practices.
Quality Assurance: Continuously validate the effectiveness of implemented rules and automation, ensuring critical health indicators remain highly visible. Lead post-implementation reviews and drive iterative improvements.

Benefits

medical, dental & vision coverage
401(k)
life, accident, and disability insurance
wellness programs
paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume