About The Position

The Oracle Cloud Infrastructure (OCI) Compute team delivers bare metal and virtual machines, including CPUs and GPUs, at scale. Given the rapid growth in machine learning, the performance and efficiency of these cloud services are critical. The Core Architecture team focuses on identifying and addressing performance and efficiency constraints throughout the entire lifecycle of compute services, from inventory management and capacity ingestion to placement, repair, and decommissioning. Consulting engineers are tasked with performing in-depth analysis of business problems and then proposing and incubating new automated solutions to meet the demands of Oracle's largest customers. This role involves leading the architectural definition for new host lifecycle management capabilities that will power the next generation of the Compute Control Plane. This initiative spans various Compute domains, such as GPU validation and repairs, and requires driving engineers from these organizations to develop cohesive, microservice-based solutions to enable Compute to scale with growing customer demands. The ideal candidate is a hands-on senior engineer with broad technical expertise, proven experience in solving cloud-scale problems, and extensive experience in distributed systems design and implementation to build fault-tolerant solutions that will form the foundation of future Compute offerings. Strong written and verbal communication skills, the ability to lead projects across organizational boundaries, and experience presenting work to senior leaders are essential.

Requirements

  • BS or MS degree in Computer Science/Engineering or a related IT field or equivalent experience relevant to functional area
  • 10+ years of development experience with large scale, highly available distributed systems
  • Proficiency with Cloud-based Data Store primitives
  • Proficiency in Java programming patterns
  • Experience with operating distributed services at scale
  • Expertise in Linux and operating systems
  • Systematic problem-solving approach, strong communication skills, strong ownership and drive
  • Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems

Nice To Haves

  • Experience in management and automation of end-to-end CPU/GPU lifecycles at scale
  • Proficiency with Cloud and CICD environments
  • Proficiency with Terraform, Docker
  • Proficiency with modern build tools and pipelines
  • Proficiency building multi-tenant, virtualized infrastructure
  • Proficiency with change control management and mature operating processes
  • Proficiency with Security including Identity, SSL and certificates
  • Proficiency with Database and Data Stores

Responsibilities

  • Perform deep analysis into business problems and propose & incubate new automated solutions that address the needs of some of our largest customers
  • Take the lead in defining the architecture for the brand-new host lifecycle management capabilities that will power the next generation of the Compute Control Plane
  • Drive engineers from multiple Compute organizations (from GPU validation to repairs) to build cohesive microservice based solutions that will enable Compute to scale for growing customer demands
  • Propose, scope, design and direct automation, optimizations, and enhancements
  • Mentor junior engineers

Benefits

  • Competitive benefits that support our people with flexible medical, life insurance, and retirement options
  • Employee volunteer programs

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Principal

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service