Senior Site Reliability Engineer

OraclePleasanton, CA

About The Position

We are looking for a Site Reliability Engineer 3 to support mission-critical cloud services and production operations. The role focuses on improving service reliability, reducing operational risk, automating repetitive tasks, and driving faster detection and resolution of issues. The engineer will work closely with development, infrastructure, security, and operations teams to monitor service health, troubleshoot production issues, participate in incident response, improve observability, and implement reliability best practices. This role also includes analyzing recurring failures, building automation, supporting deployments, and contributing to capacity planning, disaster recovery, and operational readiness. Also works on number of different region/realm rollouts, deployments. Forecasts demands and responds to capacity needs. Collaborates with software development teams to develop reliable and scalable infrastructures. Performs data collection to maintain and optimize operations and reliability. Leverages knowledge to perform incident response and/or maintenance tasks. Provides health and performance reporting. Identifies opportunities for automation. Communicates about services and identifies and explains the potential impact of changes. Provides support for technology and document incidents. Experiments with new tools and assesses potential impact and develops knowledge of site reliability trends.

Requirements

  • Site Reliability Engineer 3 experience
  • Experience supporting mission-critical cloud services and production operations
  • Experience with improving service reliability
  • Experience with reducing operational risk
  • Experience with automating repetitive tasks
  • Experience with faster detection and resolution of issues
  • Ability to work closely with development, infrastructure, security, and operations teams
  • Experience monitoring service health
  • Experience troubleshooting production issues
  • Experience participating in incident response
  • Experience improving observability
  • Experience implementing reliability best practices
  • Experience analyzing recurring failures
  • Experience building automation
  • Experience supporting deployments
  • Experience contributing to capacity planning
  • Experience contributing to disaster recovery
  • Experience contributing to operational readiness
  • Experience with region/realm rollouts and deployments
  • Experience forecasting demands and responding to capacity needs
  • Experience collaborating with software development teams to develop reliable and scalable infrastructures
  • Experience performing data collection to maintain and optimize operations and reliability
  • Experience performing incident response and/or maintenance tasks
  • Experience providing health and performance reporting
  • Experience identifying opportunities for automation
  • Experience communicating about services and identifying and explaining the potential impact of changes
  • Experience providing support for technology
  • Experience documenting incidents
  • Experience experimenting with new tools and assessing potential impact
  • Experience developing knowledge of site reliability trends

Responsibilities

  • Improving service reliability
  • Reducing operational risk
  • Automating repetitive tasks
  • Driving faster detection and resolution of issues
  • Monitoring service health
  • Troubleshooting production issues
  • Participating in incident response
  • Improving observability
  • Implementing reliability best practices
  • Analyzing recurring failures
  • Building automation
  • Supporting deployments
  • Contributing to capacity planning
  • Contributing to disaster recovery
  • Contributing to operational readiness
  • Working on region/realm rollouts and deployments
  • Forecasting demands and responding to capacity needs
  • Collaborating with software development teams to develop reliable and scalable infrastructures
  • Performing data collection to maintain and optimize operations and reliability
  • Performing incident response and/or maintenance tasks
  • Providing health and performance reporting
  • Identifying opportunities for automation
  • Communicating about services and identifying and explaining the potential impact of changes
  • Providing support for technology
  • Documenting incidents
  • Experimenting with new tools and assessing potential impact
  • Developing knowledge of site reliability trends

Benefits

  • Flexible medical
  • Life insurance
  • Retirement options
  • Volunteer programs
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service