Principal Site Reliability Engineer - Public Safety

Oracle

About The Position

Oracle Public Safety is delivering a next‑generation SaaS platform that empowers First Responders with resilient, secure, and highly available software. Our mission is to ensure our services are reliable at scale, performant under real‑time workloads, and continuously improving—so First Responders can better serve our communities. We are growing rapidly and seeking an experienced Principal Site Reliability Engineer to help build, operate, and optimize our production platforms. As an Principal SRE, you are a hands-on engineer who thrives at the intersection of software and systems. You define and uphold reliability standards, drive automation, and partner closely with product and engineering to design systems that are observable, scalable, secure, and cost‑efficient. You’re comfortable owning complex production environments, leading incident response, and turning learnings into lasting improvements. You communicate clearly, influence cross‑functional teams, and champion a culture of operational excellence.

Requirements

6–10 years of hands-on experience in Site Reliability Engineering, Production Engineering, or closely related software/systems roles.
Strong Linux/Unix fundamentals (Oracle Linux preferred) and systems performance tuning.
Proficiency operating services on OCI (preferred) or another major cloud; solid understanding of networking, VPCs, IAM, and security groups.
Containers and orchestration expertise: Docker and Kubernetes (including Helm, operators, and multi‑cluster strategies).
CI/CD experience (Jenkins or GitLab CI) with progressive delivery patterns, quality gates, and environment promotions.
Programming languages: Java experience is required, including debugging, performance tuning, and operability of Java-based microservices in production. Scripting and automation in Bash.
Infrastructure as Code and automation: Terraform, Ansible.
Datastores: Oracle Database, MySQL; familiarity with MS SQL and/or NoSQL is a plus; experience with performance, HA, and backup/restore.
Observability: hands-on with metrics/logs/traces (e.g., Prometheus, Grafana, OCI Monitoring/Logging, OpenTelemetry); alert design and runbooks.
Version control and collaboration: Git (Bitbucket preferred); issue tracking and documentation (Jira, Confluence).
Experience with ITIL practices (Incident, Problem, Change; Foundation certification preferred) and Agile delivery frameworks.
Familiarity with web and microservices architectures, REST/GraphQL, API gateways, and edge/CDN patterns.
A systems thinker with excellent communication skills; able to move from strategy to detailed implementation and influence across teams.
Self‑starter; comfortable owning complex production systems and driving cross‑functional reliability initiatives.

Nice To Haves

Experience with service mesh (e.g., Istio), policy as code (OPA), and secrets management (Vault/OCI Vault).
Chaos engineering, reliability testing frameworks, or game day facilitation.
Cost management/FinOps in multi‑tenant SaaS.
Experience supporting AI/ML or real‑time event/data processing platforms in production.