About The Position

This role provides technical leadership for the core data platforms behind Oracle Health’s Data & Analytics Platform. As a Principal Site Reliability Engineer (SRE), you will own shared, mission-critical systems used by multiple products and teams. You will lead the design and operation of large-scale, stateful distributed platforms, including Hadoop ecosystem components (HDFS, YARN, HBase) deployed on Oracle Big Data Service (BDS), Kafka, and Storm. These multi-tenant platforms are deployed and operated through Ansible- and Terraform-based automation and require strong architectural ownership to manage scale, change, and broad blast radius.

Requirements

  • U.S. Citizenship Required
  • eligibility for a Federal Security Clearance

Responsibilities

  • Own the end-to-end reliability, scalability, and operability of shared data platforms
  • Define platform standards, architectural direction, and operational guardrails
  • Influence cross-team technical decisions and long-term platform strategy
  • Drive long-term platform evolution and influence reliability strategy across the data ecosystem
  • Lead platform architecture and design reviews
  • Clearly articulate system behavior, dependencies, and failure modes
  • Make principled trade-offs between reliability, performance, cost, and complexity
  • Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively
  • Establish capacity models, scaling strategies, and operational best practices
  • Design platforms that behave predictably under load, failure, and change
  • Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery
  • Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
  • Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades
  • Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication
  • Treat security as a first-class architectural concern
  • Design and evolve an Ansible- and Terraform-driven automation framework
  • Treat automation as production software: versioned, reviewed, tested, and improved
  • Eliminate operational toil by encoding reliability and safety into the platform
  • Serve as the ultimate escalation point for complex or ambiguous incidents
  • Focus on eliminating entire classes of failure, not just resolving individual issues
  • Represent SRE and platform engineering in high-visibility and sensitive forums
  • Communicate clearly with engineering leadership and partner teams

Benefits

  • flexible medical
  • life insurance
  • retirement options

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Principal

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service