Director, Data Center Reliability Engineering

Oracle•Nashville, TN

23h

About The Position

This role leads reliability engineering and analytics teams across multiple sites, focusing on standardizing and enforcing methodologies like FMEA and RCA, and overseeing the deployment of monitoring, analytics, and automation tools. The position involves defining, tracking, and reporting reliability KPIs to executive leadership, ensuring corrective actions are implemented, and developing engineers in data-driven problem-solving. The role offers a global impact at scale within Oracle Cloud Infrastructure's technically rigorous and operationally excellent environment, with opportunities for long-term career development.

Requirements

12 years of experience in IT infrastructure support, server administration, or data center operations, design, and layout OR Bachelor's Degree in Computer Science, Engineering, Information Systems, Information Technology, or related field AND 8 years of experience in IT infrastructure support, server administration, or data center operations, design, and layout OR Master's Degree in Computer Science, Engineering, Information Systems, Information Technology, or related field AND 6 years of experience in IT infrastructure support, server administration, or data center operations, design, and layout OR Doctorate in Computer Science, Engineering, Information Systems, Information Technology, or related field AND 4 years of experience in IT infrastructure support, server administration, or data center operations, design, and layout.
Demonstrated ability to provide technical leadership and mentoring on complex projects.
Demonstrated knowledge of and adherence to regulatory, legal, and organizational compliance requirements.
Demonstrated ability to plan and carry out quality assurance activities for high-standard deliverables.
Demonstrated ability to select, onboard, and manage vendor relationships to ensure optimal performance.
Demonstrated experience managing data center facilities, including systems, operations, and resource allocation.
Senior experience in reliability engineering, maintenance engineering, or uptime-critical environments.
Strong background in analytics, RCA rigor, and reliability frameworks.
Strong technical leadership and stakeholder influence.
Comfortable translating analysis into executive-level decisions.

Nice To Haves

13 years of experience in IT infrastructure support, server administration, or data center operations, design, and layout OR Bachelor's Degree in Computer Science, Engineering, Information Systems, Information Technology, or related field AND 9 years of experience in IT infrastructure support, server administration, or data center operations, design, and layout OR Master's Degree in Computer Science, Engineering, Information Systems, Information Technology, or related field AND 7 years of experience in IT infrastructure support, server administration, or data center operations, design, and layout OR Doctorate in Computer Science, Engineering, Information Systems, Information Technology, or related field AND 5 years of experience in IT infrastructure support, server administration, or data center operations, design, and layout.
5 years of experience in a leadership role with direct reports.
3 years of experience working with operating budgets and/or project financials.
Data Center or Cloud Industry Certifications.

Responsibilities

Lead reliability engineering and analytics teams across multiple sites.
Standardize and enforce FMEA, RCA, and continuous improvement methodologies.
Oversee deployment of monitoring, analytics, and automation tools supporting reliability programs.
Define, track, and report reliability KPIs to executive and global operations leadership.
Ensure corrective actions are implemented, verified, and sustained.
Develop engineers and analysts in disciplined, data-driven problem solving.
Data Center Site Portfolio Management: Act as Data Center country leader with responsibility for one or more sites & teams in a region.
Performance Monitoring and Analysis: Set strategic direction for data center operations performance monitoring, network performance evaluation, and analysis of physical, power, and cooling capacity, collaborating with executive leadership.
Define the strategic direction for continuous improvement to achieve KPIs and objectives.
Issue Management and Automation: Oversee support for escalated complex technical issues, define and enforce strategies for issue triage using automation, scheduling, and monitoring tools.
Identify, document, and standardize issues, processes, and solutions, ensuring a comprehensive data center knowledge base.
Oversee the implementation of incident or crisis management protocols aligned with business continuity plans.
Establish best practices for Root Cause Analysis (RCA) and update documentation for process improvements.
Data Center Expansion Support: Set strategic direction and oversee new region builds and expansion activities.
Act as the primary liaison with senior project teams and data center engineering leadership for expansion projects.
Collaborate with project teams to ensure world-class standards in expansion projects and site builds.
Installation and Maintenance: Direct installations, repairs, inventory management, and logistics across data centers.
Establish standards and best practices for component replacements and upgrades.
Advise on and manage large-scale purchases or upgrades for data centers.
Ensure implementation of proactive maintenance and lifecycle management strategies for data center facilities.
Planning & Execution: Oversee and guide multiple teams on managing complex projects, monitoring timelines, deliverables, and budgets.
Collaboration & Partnership: Lead cross-functional collaborative efforts, build and maintain partnerships with business leaders, stakeholders, and customers.
Problem Solving: Share problem-solving strategies, provide oversight on complex operational and/or technical issues, and coach teams on analyzing complex data to identify solutions.
Continuous Learning: Pursue strategic learning opportunities, create opportunities for team members to build expertise, and identify skill gap trends.
Continuous Improvement: Empower teams to own the development and implementation of process improvements, and prioritize the roadmap of improvement initiatives.
Performance and Development: Drive performance through tailored feedback and coaching, ensure consistency in talent development procedures, and align individual development goals with organizational strategic initiatives.