Director, Site Reliability Engineering

Oracle•Seattle, WA

1d•$121,500 - $306,400

About The Position

Provides leadership to one or more teams designing and architecting infrastructure and service and provides input on best practices for reliability and functionality. Establishes direction to ensure accurate forecasting and ensure systems have adequate resources. Builds collaborative relationships with the software development team to create reliable, scalable infrastructures. Ensures alignment regarding data collection and contributes to standards for optimizing operations and infrastructure reliability. Defines approaches for incident response activities to ensure service reliability. Ensures in-depth reports. Plays a key role in developing standards for identifying and recommending automation. Anticipates and explains the impact of changes, mentoring other managers on what to communicate. Defines approaches for escalating incidents and refines methods for documentation. Encourages experimenting with new technology, executing improvements, building site reliability knowledge, and providing clear data.

Requirements

Leadership experience
Experience designing and architecting infrastructure and services
Knowledge of best practices for reliability and functionality
Forecasting and resource allocation for systems
Collaboration with software development teams
Understanding of data collection and operational standards
Experience defining incident response approaches
Experience developing standards for automation
Mentoring skills
Experience defining incident escalation methods
Experience with documentation refinement
Experience with new technology experimentation
Experience with site reliability knowledge building
Ability to provide clear data

Responsibilities

Provides leadership to one or more teams designing and architecting infrastructure and service
Provides input on best practices for reliability and functionality
Establishes direction to ensure accurate forecasting and ensure systems have adequate resources
Builds collaborative relationships with the software development team to create reliable, scalable infrastructures
Ensures alignment regarding data collection and contributes to standards for optimizing operations and infrastructure reliability
Defines approaches for incident response activities to ensure service reliability
Ensures in-depth reports
Plays a key role in developing standards for identifying and recommending automation
Anticipates and explains the impact of changes, mentoring other managers on what to communicate
Defines approaches for escalating incidents and refines methods for documentation
Encourages experimenting with new technology, executing improvements, building site reliability knowledge, and providing clear data