Senior Site Reliability Engineer

Oracle•Pleasanton, CA

About The Position

We are looking for a Site Reliability Engineer 3 to support mission-critical cloud services and production operations. The role focuses on improving service reliability, reducing operational risk, automating repetitive tasks, and driving faster detection and resolution of issues. The engineer will work closely with development, infrastructure, security, and operations teams to monitor service health, troubleshoot production issues, participate in incident response, improve observability, and implement reliability best practices. This role also includes analyzing recurring failures, building automation, supporting deployments, and contributing to capacity planning, disaster recovery, and operational readiness. Also works on number of different region/realm rollouts, deployments. Forecasts demands and responds to capacity needs. Collaborates with software development teams to develop reliable and scalable infrastructures. Performs data collection to maintain and optimize operations and reliability. Leverages knowledge to perform incident response and/or maintenance tasks. Provides health and performance reporting. Identifies opportunities for automation. Communicates about services and identifies and explains the potential impact of changes. Provides support for technology and document incidents. Experiments with new tools and assesses potential impact and develops knowledge of site reliability trends.

Requirements

Site Reliability Engineer 3 experience
Experience supporting mission-critical cloud services and production operations
Experience with improving service reliability
Experience with reducing operational risk
Experience with automating repetitive tasks
Experience with faster detection and resolution of issues
Ability to work closely with development, infrastructure, security, and operations teams
Experience monitoring service health
Experience troubleshooting production issues
Experience participating in incident response
Experience improving observability
Experience implementing reliability best practices
Experience analyzing recurring failures
Experience building automation
Experience supporting deployments
Experience contributing to capacity planning
Experience contributing to disaster recovery
Experience contributing to operational readiness
Experience with region/realm rollouts and deployments
Experience forecasting demands and responding to capacity needs
Experience collaborating with software development teams to develop reliable and scalable infrastructures
Experience performing data collection to maintain and optimize operations and reliability
Experience performing incident response and/or maintenance tasks
Experience providing health and performance reporting
Experience identifying opportunities for automation
Experience communicating about services and identifying and explaining the potential impact of changes
Experience providing support for technology
Experience documenting incidents
Experience experimenting with new tools and assessing potential impact
Experience developing knowledge of site reliability trends

Responsibilities

Improving service reliability
Reducing operational risk
Automating repetitive tasks
Driving faster detection and resolution of issues
Monitoring service health
Troubleshooting production issues
Participating in incident response
Improving observability
Implementing reliability best practices
Analyzing recurring failures
Building automation
Supporting deployments
Contributing to capacity planning
Contributing to disaster recovery
Contributing to operational readiness
Working on region/realm rollouts and deployments
Forecasting demands and responding to capacity needs
Collaborating with software development teams to develop reliable and scalable infrastructures
Performing data collection to maintain and optimize operations and reliability
Performing incident response and/or maintenance tasks
Providing health and performance reporting
Identifying opportunities for automation
Communicating about services and identifying and explaining the potential impact of changes
Providing support for technology
Documenting incidents
Experimenting with new tools and assessing potential impact
Developing knowledge of site reliability trends