Associate Director - IT Service Reliability & Operations

Eli Lilly and Company•Indianapolis, IN

About The Position

This role is accountable for the end-to-end operational reliability, demand management, and capacity sustainability of Tech@Lilly’s centralized service and reliability operations, ensuring that incidents, events, requests, systemic reliability risks, and operational demand are managed through a centralized, standardized, data‑driven, and increasingly automated operating model. This role serves as the operations and demand lead for a centralized Service Reliability and Operations capability leading the progressive shift from human-executed operations to an increasingly automated, agent-assisted operating model. The role is responsible for driving technology-enabled transformation across operations through the practical application of AI, automation, and agentic solutions, demand forecasting, and capacity planning to ensure services scale predictably and reliably.

Requirements

Bachelor's degree in Business, Information Technology, STEM or related field
12+ years of IT experience, with significant time in production operations, reliability, or service management, or SRE-adjacent environments
7+ years leading vendor or supplier‑supported operating models, including capacity planning, demand forecasting, and driving automation and innovation
5+ years in people leadership roles within complex, global environments
Experience leading teams responsible for service reliability engineering or automation
Demonstrated experience leading high‑severity incident response and operational risk mitigation
Hands-on experience with AI in production operations, reliability, or service management, or SRE-adjacent environments
Qualified applicants must be authorized to work in the United States on a full-time basis. Lilly will not provide support for or sponsor work authorization or visas for this role, including but not limited to F-1 CPT, F-1 OPT, F-1 STEM OPT, J-1, H-1B, TN, O-1, E-3, H-1B1, or L-1
Strong executive communication skills with the ability to translate operational and AI signals into business‑relevant insights

Nice To Haves

Proven success implementing centralized, tiered operating models that incorporate demand management, and capacity planning to scale globally and leverage automation and AI to improve consistency and resilience.
Deep understanding of Incident, Event, Change, and Problem Management including how these practices are enhanced through automation, analytics, and AI-assisted workflows in a mature ITSM environment.
Demonstrated ability to use operational and demand data to drive capacity decisions, reliability improvements, and executive confidence.

Responsibilities

Lead a centralized Service Reliability & Operations function, providing standardized intake, triage, coordination, and governance across incidents, events, requests, and problems.
Deploy AI-assisted triage models that automatically classify, prioritize, and route incidents based on historical patterns, service risk profiles, and real-time signals.
Establish and govern an automated remediation capability for known failure patterns, with human-in-the-loop escalation for high-risk scenarios.
Own the operational runbook strategy, ensuring runbooks are not static documentation artifacts but active, machine-readable automation inputs that drive consistent, auditable response execution with or without human initiation.
Own service stability and operational readiness outcomes for a heterogeneous, global application estate.
Ensure Major Incident Management discipline, including command, escalation, and executive communications, is consistently executed for critical services.
Own operational demand management for centralized production support, ensuring requests, enhancements, onboarding, and change-driven demand are visible, prioritized, and aligned to reliability and capacity constraints.
Define and govern standardized demand intake, categorization, and prioritization models, balancing business urgency, service risk, and operational sustainability.
Leverage AI-driven demand pattern analysis to distinguish predictable, automatable demand to ensure human capacity is protected for high-judgment activity.
Use demand trends to influence service design, onboarding decisions, and support models, preventing unmanaged growth in operational complexity.
Own capacity planning using demand, incident, and service risk data to proactively forecast workload and skill needs.
Translate demand signals into staffing strategies, automation investments, and capacity plans.
Ensure the operating model scales sustainably by balancing workloads.
Strengthen Problem Management as a reliability and demand-reduction lever, using incident trends and recurrence signals to drive systemic risk reduction rather than reactive firefighting.
Own and maintain a living automation roadmap that sequences opportunities by ROI, operational risk reduction, and technical feasibility that reduce MTTR, operational toil, and demand on human resources.
Partner with engineering, platform, and SRE teams to establish feedback loops between automated remediation outcomes and the knowledge base, ensuring every automated action either confirms or improves the underlying runbook, creating a self-reinforcing reliability system.
Use incident, change, and risk data to prioritize staffing, automation, and reliability improvement investments across the portfolio.
Define and standardize operational KPIs and health indicators, including demand volume, capacity utilization, MTTR, and automation effectiveness.
Ensure ITSM and observability tooling supports consolidated intake, standardized workflows, measurable outcomes, real-time visibility, predictive insights, and actionable reporting.
Partner on AI‑enabled and automated capabilities that improve productivity and reliability across operational teams, including predictive insights, automated remediation, and agent-driven coordination across teams.
Lead and develop operations, reliability, and service management leaders, setting clear expectations for technical literacy, automation-first thinking, demand accountability, and outcome ownership.
Serve as a trusted partner to business service owners, risk, security, and technology leaders, translating operational data and AI-generated insights into actionable, business-relevant decisions.
Drive organizational change as services, suppliers, and operating models evolve, ensuring stability, transparency, and capacity is protected during transitions.

Benefits

company bonus (depending, in part, on company and individual performance)
company-sponsored 401(k)
pension
vacation benefits
eligibility for medical, dental, vision and prescription drug benefits
flexible benefits (e.g., healthcare and/or dependent day care flexible spending accounts)
life insurance and death benefits
certain time off and leave of absence benefits
well-being benefits (e.g., employee assistance program, fitness benefits, and employee clubs and activities)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume