Site Reliability Developer 3

Oracle

About The Position

The Oracle Cloud Infrastructure (OCI) team can provide you the opportunity to build and operate a suite of massive scale, integrated cloud services in a broadly distributed, multi-tenant cloud environment. OCI is committed to providing the best in cloud products with integration of the latest AI technologies. The OCI Security Products team is focused on ensuring that our cloud infrastructure is the safest and most reliable environment for development we can provide to our customers. We keep up with the constantly evolving and challenging cyber-threat landscape, by developing novel ML/AI-based solutions and products to prevent and mitigate cyber-attacks on OCI. We are looking for top notch Site Reliability Engineer to ensure the reliability, scalability, and performance of our cloud-based big data platform. You will work at the intersection of software engineering and infrastructure, building automation, improving observability, and ensuring the resilience of distributed data systems. You will play a critical role in maintaining SLAs/SLOs for large-scale batch and streaming data pipelines.

Requirements

3+ years of experience in SRE, DevOps, or Infrastructure Engineering.
Strong experience operating distributed systems in cloud environments.
Hands-on experience supporting big data technologies such as Apache Spark and Kafka
Experience with Kubernetes and containerized workloads.
Strong scripting/programming skills (Python, Go, or similar).
Experience with Infrastructure as Code (Terraform).
Deep understanding of Linux, networking, and distributed systems concepts.
Solid understanding of networking cloud architecture and set up.

Responsibilities

Design and maintain highly available, fault-tolerant infrastructure for distributed big data systems, with focus on Delta Lake and Oracle Autonomous Data Warehouse tech stacks.
Support and optimize large-scale data platforms (batch and streaming).
Define and manage SLIs, SLOs, and error budgets for data services.
Build automation for provisioning, scaling, failover, and recovery.
Improve reliability of data ingestion, transformation, and storage layers.
Lead incident response and conduct root cause analysis (RCA).
Implement observability best practices (metrics, logging, tracing).
Optimize cluster performance, cost efficiency, and capacity planning.
Partner with Data Engineering and Platform teams to improve reliability of pipelines.
Improve CI/CD processes for data platform deployments.