Director, Middleware Reliability Engineering

American Electric Power•Columbus, OH

15h•Onsite

About The Position

The Director, Middleware Reliability Engineering is a critical technology leadership role accountable for ensuring the reliability, availability, performance, and operational excellence of the enterprise middleware platform. This leader will establish and scale middleware reliability engineering (MRE) practices across mission-critical integration platforms, embedding reliability into the core of how APIs, messaging, event streaming, and orchestration services are designed, deployed, and operated. Operating at the intersection of platform engineering, site reliability engineering (SRE), and DevOps, this role drives automation, observability, and continuous improvement to ensure the middleware ecosystem delivers predictable performance, high availability, and rapid recovery at enterprise scale. The Director is responsible for stabilizing today’s operations while modernizing reliability practices to support future growth and transformation.

Requirements

Bachelor’s degree in computer science, engineering, business, or related technical field is required. An equivalent combination of education and related experience may be considered.
A minimum of 10 years of relevant work experience, which includes 6 years in leadership, is required.

Nice To Haves

Experience leading SRE or reliability teams.
Experience with AIOps and automation frameworks.
Familiarity with platforms required for regulated and unregulated markets
Strong understanding of power systems, grid dynamics, and energy markets
Knowledge of regulatory frameworks including PUCO, FERC, and NERC
Passion for sustainability and the clean energy transition

Responsibilities

Define and execute a comprehensive middleware reliability strategy aligned to enterprise availability, performance, resilience, and risk objectives.
Establish and govern SLAs, SLOs, and error budgets across middleware platforms and shared integration services.
Drive a culture of operational excellence, accountability, and continuous improvement across engineering and operations teams.
Lead incident management, problem management, and root cause analysis disciplines to reduce recurrence and improve overall system stability.
Ensure enterprise middleware platforms meet standards for high availability, scalability, fault tolerance, and recoverability.
Design and implement resilience patterns including failover, redundancy, circuit breaking, load balancing, and disaster recovery.
Lead capacity planning, performance engineering, and stress testing.
Proactively identify, assess, and remediate reliability risks.
Establish enterprise observability standards including logging, metrics, tracing, and alerting.
Implement real-time monitoring and automated alerting.
Drive automation of operational workflows including incident response and remediation.
Leverage AIOps and advanced analytics for predictive insights.
Partner with platform engineering and domain DevOps teams.
Ensure CI/CD pipelines include reliability and performance validation.
Enable shift-left reliability practices.
Support a Platform + Domain DevOps operating model.
Define and track reliability metrics including uptime, latency, error rates, and MTTR.
Establish governance for release readiness and change management.
Lead post-incident reviews with measurable outcomes.
Benchmark reliability against industry standards.
Build and lead a high-performing middleware reliability engineering organization.
Develop engineering talent and future leaders.
Foster a culture of ownership and engineering rigor.
Align teams to deliver KTLO excellence and modernization.