Sr Software Engineer - AI Infrastructure

OracleSanta Clara, CA
113d$79,800 - $178,100

About The Position

Oracle Cloud Infrastructure (OCI) is looking for a Senior Software Engineer - AI Infrastructure to lead the development of scalable, resilient, and secure infrastructure systems that underpin the core of OCI's compute platform. This role sits within the Host Provisioning Services (HoPS) team, which owns the critical infrastructure responsible for automating the full server lifecycle from rack integration and hardware bring-up to customer-ready instance provisioning and firmware management. HoPS services operate at the intersection of bare metal hardware and full-stack orchestration frameworks. They interface directly with components like BMCs, NICs, SmartNICs, ILOMs, GPUs, and custom firmware stacks. The team builds microservices and tooling that provision, configure, secure, and validate server platforms across OCI's global fleet. As a Senior Software Engineer, you will design and deliver highly available services and automation pipelines that manage server provisioning at hyperscale, enable firmware pinning for deterministic customer environments, and deliver fleet-wide firmware updates and telemetry-based observability. You'll drive solutions to support new silicon (e.g., NVIDIA, AMD, Intel platforms), SmartNIC/HostNIC convergence, RoT security integration, and the evolution of OCI's infrastructure into next-gen clusters and composable hardware environments. You will partner closely with teams across Compute, Networking, Security, Datacenter Engineering, and Hardware Development to ensure OCI can launch, scale, and maintain new server platforms with minimal operational overhead and high reliability. This role is ideal for experienced systems engineers with a deep understanding of operating systems, hardware-software integration, distributed services, and cloud-scale automation.

Requirements

  • Deep understanding of operating systems.
  • Experience in hardware-software integration.
  • Knowledge of distributed services.
  • Expertise in cloud-scale automation.

Responsibilities

  • Design, develop, and maintain highly available and scalable microservices for OCI's server provisioning and lifecycle management.
  • Lead automation of the full server lifecycle including rack integration, hardware bring-up, provisioning, and firmware management.
  • Build systems that interface directly with bare metal components such as BMCs, ILOMs, NICs, SmartNICs, and GPUs.
  • Develop automation pipelines for provisioning, firmware validation, and observability across OCI's global fleet.
  • Implement firmware pinning and update mechanisms to support deterministic and secure customer environments.
  • Deliver telemetry-backed monitoring and alerting systems to ensure infrastructure health and visibility.
  • Support onboarding of new hardware platforms, including custom silicon and next-gen server technologies (e.g., NVIDIA GB200, AMD, Intel).
  • Enable secure root-of-trust (RoT) integrations and SmartNIC/HostNIC convergence for next-generation platform reliability.
  • Collaborate with cross-functional teams across Compute, Networking, Security, Datacenter Engineering, and Hardware Development.
  • Contribute to the evolution of OCI infrastructure toward composable hardware and next-generation data center clusters.
  • Drive design reviews, participate in on-call rotations, and contribute to operational excellence and incident prevention.
  • Provide technical leadership in troubleshooting, root cause analysis, and continuous improvement of service reliability.

Benefits

  • Medical, dental, and vision insurance, including expert medical opinion.
  • Short term disability and long term disability.
  • Life insurance and AD&D.
  • Supplemental life insurance (Employee/Spouse/Child).
  • Health care and dependent care Flexible Spending Accounts.
  • Pre-tax commuter and parking benefits.
  • 401(k) Savings and Investment Plan with company match.
  • Flexible Vacation and paid time off.
  • 11 paid holidays.
  • Paid sick leave: 72 hours of paid sick leave upon date of hire.
  • Paid parental leave.
  • Adoption assistance.
  • Employee Stock Purchase Plan.
  • Financial planning and group legal.
  • Voluntary benefits including auto, homeowner and pet insurance.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service