Site Reliability Developer 3

Oracle•Reston, VA

59d

About The Position

At Oracle Cloud Infrastructure (OCI), we build the future of the cloud for Enterprises as a diverse team of fellow creators and inventors. We act with the speed and attitude of a start-up, with the scale and customer-focus of the leading enterprise software company in the world. Values are OCI’s foundation and how we deliver excellence. We strive for equity, inclusion, and respect for all. We are committed to the greater good in our products and our actions. We are constantly learning and taking opportunities to grow our careers and ourselves. We challenge each other to stretch beyond our past to build our future. You are the builder here. You will be part of a team of really smart, motivated, and diverse people and given the autonomy and support to do your best work. It is a dynamic and flexible workplace where you’ll belong and be encouraged. Site Reliability Developer Oracle Cloud Infrastructure (OCI) - OCI National Security Regions Reston, VA/ Seattle, WA/ Austin, TX https://www.oracle.com/cloud/ OCI National Security Region Networking team is looking for a Senior Site Reliability Engineer . As a Site Reliability Engineer, you will solve interesting technical challenges by defining, designing, deploying, and troubleshooting key Network Automation services focusing on scalability, security, and performance. The role involves software engineering, systems engineering, automation, network operations, and DevOps. You should be comfortable at building complex distributed systems. You will incorporate the ethos of software engineering and apply it to large-scale operational problems. Your primary goals are to create highly reliable and services, platforms, and infrastructure, always thinking about reliability, security, and ultra-scalable software systems to manage operations. When not working on operations, you will be working on software engineering tasks such as design and development of systems that increase reliability, scalability, and reduce operational overhead through automation. You should value simplicity and scale, work comfortably in a collaborative, agile environment, and be excited to learn. A great software engineer will make all the difference for delivering quality solutions to our customers. Are you passionate about designing, developing, testing and delivering cloud services? Do you thrive in a fast-paced environment, and want to be an integral part of a truly great team? Come join us! As a Senior Site Reliability Engineer , you will be responsible for: System Design and Operation: Design and manage distributed Unix-based systems, particularly Oracle Linux. Implement auto-scaling and self-healing infrastructure to ensure uptime and durability. Tune system internals, including kernel parameters, networking, and filesystems, for high performance. Maintain timely OS patching and compliance posture across environments. Integrate systems with enterprise identity services such as Active Directory, LDAP, and Kerberos. Automation and Infrastructure as Code: Develop and maintain infrastructure automation using Ansible and Terraform. Automate deployment pipelines, service configurations, and patch management. Develop scripts and services in Python and Bash to enhance infrastructure delivery workflows. Extend APIs and platform automation to drive efficiency and repeatability. Observability and Incident Response: Develop observability stacks using tools like Prometheus, Grafana, and other open-source telemetry tools. Create dashboards and SLO/SLI-based alerts for real-time monitoring of production systems. Participate in a global 24/7 on-call rotation, leading responses for high-severity incidents. Conduct post-incident analysis (RCA) and drive remediations that improve long-term reliability. Collaboration and Standards: Partner with development teams to embed reliability in deployment pipelines. Help define system architecture standards and maintain robust platform documentation. Mentor engineers in Unix performance, observability, and debugging practices. Champion a culture of automation, resilience, and continuous improvement.

Requirements

US Government TS/SCI with Polygraph
U.S. Citizenship– Federal Government customer
Bachelor’s or Master’s degree in CS or related engineer field
5+ years of experience in software development/ IT operations
5+ years in SRE, Infrastructure, or Systems Engineering roles managing production services.
Deep expertise with Unix/Linux systems, particularly Oracle Linux.
Experience in kernel tuning, performance profiling, and debugging complex system issues.
Proficiency in Python and Bash scripting.
Strong grasp of Infrastructure as Code tools like Ansible and Terraform.
Experience running hybrid infrastructure (on-premises) with VMware, containers, and Kubernetes.
Hands-on experience with monitoring, telemetry, and observability stacks.
Excellent problem-solving skills; ability to multi-task and prioritize.
Ability to work independently; works well under pressure.
Strong communication and collaboration skills with the ability to engage and influence.
Self-motivated, able, and willing to help where help is needed.
Able to build and establish relationships, be culturally sensitive, have goal alignment, and learning agility.
High-reaching to work with geographically distributed teams.

Nice To Haves

Experience with virtualization and container technologies (e.g., Docker, Kubernetes).
Experience with continuous integration platforms such as Jenkins.
Experience with monitoring and alerting technologies (e.g., Prometheus, Grafana).
Experience with PostgreSQL; understanding of replication, failover, backups.
Experience with Git.

Responsibilities

System Design and Operation: Design and manage distributed Unix-based systems, particularly Oracle Linux.
Implement auto-scaling and self-healing infrastructure to ensure uptime and durability.
Tune system internals, including kernel parameters, networking, and filesystems, for high performance.
Maintain timely OS patching and compliance posture across environments.
Integrate systems with enterprise identity services such as Active Directory, LDAP, and Kerberos.
Automation and Infrastructure as Code: Develop and maintain infrastructure automation using Ansible and Terraform.
Automate deployment pipelines, service configurations, and patch management.
Develop scripts and services in Python and Bash to enhance infrastructure delivery workflows.
Extend APIs and platform automation to drive efficiency and repeatability.
Observability and Incident Response: Develop observability stacks using tools like Prometheus, Grafana, and other open-source telemetry tools.
Create dashboards and SLO/SLI-based alerts for real-time monitoring of production systems.
Participate in a global 24/7 on-call rotation, leading responses for high-severity incidents.
Conduct post-incident analysis (RCA) and drive remediations that improve long-term reliability.
Collaboration and Standards: Partner with development teams to embed reliability in deployment pipelines.
Help define system architecture standards and maintain robust platform documentation.
Mentor engineers in Unix performance, observability, and debugging practices.
Champion a culture of automation, resilience, and continuous improvement.

Benefits

Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance