Consulting Member of Technical Staff

Oracle•Seattle, WA

22h

About The Position

The Oracle Cloud Infrastructure (OCI) Compute team delivers bare metal and virtual machines, including CPUs and GPUs, at scale. Given the rapid growth in machine learning, the performance and efficiency of these cloud services are critical. The Core Architecture team focuses on identifying and addressing performance and efficiency constraints throughout the entire lifecycle of compute services, from inventory management and capacity ingestion to placement, repair, and decommissioning. Consulting engineers are tasked with performing in-depth analysis of business problems and then proposing and incubating new automated solutions to meet the demands of Oracle's largest customers. This role involves leading the architectural definition for new host lifecycle management capabilities that will power the next generation of the Compute Control Plane. This initiative spans various Compute domains, such as GPU validation and repairs, and requires driving engineers from these organizations to develop cohesive, microservice-based solutions to enable Compute to scale with growing customer demands. The ideal candidate is a hands-on senior engineer with broad technical expertise, proven experience in solving cloud-scale problems, and extensive experience in distributed systems design and implementation to build fault-tolerant solutions that will form the foundation of future Compute offerings. Strong written and verbal communication skills, the ability to lead projects across organizational boundaries, and experience presenting work to senior leaders are essential.

Requirements

BS or MS degree in Computer Science/Engineering or a related IT field or equivalent experience relevant to functional area
10+ years of development experience with large scale, highly available distributed systems
Proficiency with Cloud-based Data Store primitives
Proficiency in Java programming patterns
Experience with operating distributed services at scale
Expertise in Linux and operating systems
Systematic problem-solving approach, strong communication skills, strong ownership and drive
Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems

Nice To Haves

Experience in management and automation of end-to-end CPU/GPU lifecycles at scale
Proficiency with Cloud and CICD environments
Proficiency with Terraform, Docker
Proficiency with modern build tools and pipelines
Proficiency building multi-tenant, virtualized infrastructure
Proficiency with change control management and mature operating processes
Proficiency with Security including Identity, SSL and certificates
Proficiency with Database and Data Stores

Responsibilities

Perform deep analysis into business problems and propose & incubate new automated solutions that address the needs of some of our largest customers
Take the lead in defining the architecture for the brand-new host lifecycle management capabilities that will power the next generation of the Compute Control Plane
Drive engineers from multiple Compute organizations (from GPU validation to repairs) to build cohesive microservice based solutions that will enable Compute to scale for growing customer demands
Propose, scope, design and direct automation, optimizations, and enhancements
Mentor junior engineers