Director, Reliability Engineering

ServiceNow•Washington, DC

1d•$221,200 - $387,100•Hybrid

About The Position

ServiceNow is seeking a Director of SRE Reliability Engineering to lead a strategic, enterprise-scale evolution of reliability engineering. Embedded within the Site Reliability & Database Engineering organization, this leader will drive technical excellence and operational performance to ensure ServiceNow’s cloud platform maintains industry-leading reliability and performance. This role is accountable for advancing SRE maturity, accelerating adoption of DevOps and automation-first practices, and preparing global teams for the platforms, operating models, and cultural shifts required to sustain ServiceNow’s cloud reliability at scale.

Requirements

Proven track record leading operational excellence and reliability inititaves in a technical, engineering, or cloud operations environment.
Strong cross-functional leadership experience, with the ability to influence without authority across Engineering, Product, People, Finance, and executive stakeholders.
Deep familiarity with structured change management methodologies and the ability to adapt those frameworks to fast-moving engineering cultures.
Demonstrated ability to apply AI-enabled automation, workflow redesign, and operational insights to improve decision-making, reduce toil, and scale reliability practices.
Hands-on background in SRE and DevOps environments, sufficient to credibly lead change programs with engineering teams and assess technical impact.
Strong understanding of cloud operations models (SaaS, PaaS, IaaS) and the ServiceNow platform or comparable enterprise platforms.
Comprehensive knowledge of ITIL principles (v3 or v4) and how process transformation intersects with service management maturity.
Experience operating across global, follow-the-sun models, with cultural intelligence, geographic awareness, and experience managing distributed teams.
Exceptional operational communication and storytelling skills, with the ability to translate complex technical initiatives into compelling business narratives and actionable operating plans.
Strong analytical and problem-solving capabilities; you use data and metrics to drive change and demonstrate outcomes.
Senior leadership experience in SRE, DevOps, Cloud Engineering, or Reliability Engineering, ideally within regulated or enterprise-scale environments.
Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
12+ years in software engineering, with at least 5–7 years in a senior leadership role overseeing large-scale distributed systems using public cloud technologies from Azure, Amazon Web Services, or Google Cloud Platform.
Experience delivering SaaS products with high performance, scalability, security, and availability.
Experience developing agentic AI frameworks to eliminate toil by automating the detection and remediation of transient service disruptions.
Proven experience with cloud-native architectures, Infrastructure as Code (IaC), Kubernetes, and microservices at scale.
Proven commitment to eliminating recurring problems through durable engineering solutions, automation, and systemic reliability improvements.
Demonstrated success building, leading, and developing globally distributed engineering teams in fast-paced, high-growth environments.
Empathetic, collaborative, and systems-oriented leader who enables teams, builds trust, and guides organizations from hero-based operations toward scalable reliability practices.
Executive-level communication skills with the ability to translate complex technical topics into clear business implications, decisions, and priorities.
Bachelor’s degree in computer science or a related field.

Responsibilities

Define and execute structured change strategies for major SRE initiatives, platform migrations, and operating model shifts, including stakeholder engagement, impact assessments, communication plans, and adoption risk management.
Responsible for all incidents and escalations as it pertains to the SRE teams and the associated process and workflows with particular focus on maintaining the performance and availability of the supported environments. Participate in the continued development and execution of SRE management processes including Incident, Problem, Configuration, and Change management.
Serve as a senior change leader across Engineering, Product, People Operations, Finance, and executive stakeholders to align priorities, remove technical blockers, and sustain reliable delivery.
Drive organizational readiness ahead of major technology and process deployments by partnering with enablement, training, and communications teams to accelerate adoption.
Establish reliability metrics, SLO targets, performance baselines and governance mechanisms to track progress, surface operational risks, and communicate outcomes to senior leadership.
Champion a culture of continuous improvement, learning, and psychological safety by embedding feedback loops and retrospectives into standard operating practices.
Facilitate executive steering forums and cross-functional working groups, producing clear communications that translate reliability initiatives into business outcomes.
Build trust and alignment across SRE teams by bridging cultural, geographic, and operational differences with clarity, empathy, and structure.
Partner with People Operations on workforce planning, change impact assessments, and role evolution strategies as the organization modernizes its delivery model.
Analyze current procedures and processes and drive continuous improvements efforts to ensure the SRE provide a quality service across all functional areas.
Setting up and continuous monitoring of KPI’s and metrics pertaining to individual and team performance.
Lead and develop high-performing teams, with accountability for team health, capability growth, career development, and performance outcomes.
Establish and evolve career frameworks, learning pathways, and development programs that reflect both technical mastery and change leadership competencies.
Lead talent acquisition and onboarding strategies that build resilient, change-ready, and growth-oriented SRE teams.
Drive a DevOps and automation-first mindset across teams, reducing reliance on manual and repetitive processes through structured transformation initiatives.
Ensure rigorous change governance, overseeing how changes are scheduled, communicated, executed, and measured, with a strong focus on risk reduction in regulated environments.
Lead process redesign efforts that improve consistency, predictability, and quality of service delivery across all SRDE functional areas.
Establish documentation and training standards that equip internal partners and downstream teams to operate effectively through periods of change.