About The Position

OCI is Oracle’s next-generation cloud platform, built for the most demanding enterprise workloads. We are focused on delivering high-performance computing, storage, networking, and platform services at global scale. The AI Platform, Services & Solutions organization within OCI is building a robust ecosystem to support the end-to-end lifecycle of AI and machine learning workloads. From GPU infrastructure and training pipelines to model serving and deployment tools—we empower teams across Oracle and our customers to build and deploy AI at scale. We are looking for a Principal Software Engineer to join our growing team and help shape the future of AI infrastructure and services at Oracle. You will work on critical components of OCI’s AI platform, including high-scale GPU cluster management, self-service ML infrastructure, and model serving systems. Work on critical AI infrastructure that powers Oracle’s GenAI and ML initiatives. Contribute to high-impact projects with visibility across Oracle Cloud. Collaborate with top engineers and researchers in a fast-paced, innovation-driven environment. Grow your career in a supportive, mission-driven team building the future of enterprise AI.

Requirements

  • 8+ years of experience shipping scalable, cloud native distributed systems
  • Experience with building multi-tenant Kubernetes and security isolation.
  • Built Kubernetes controllers, operators and CRDs to automate lifecycle management of AI/ML workloads .
  • Implement advanced optimizations: distributed and disaggregated inference serving, multi-node inference, KV-cache reuse.
  • Build intelligent request routing and adaptive scheduling to maximize GPU utilization.
  • Experience inference solutions like: Nvidia Dynamo, vLLM, Ray Serve.
  • Experience with production operations and best practices for putting quality code in production and troubleshoot issues when they arise
  • Able to effectively communicate technical ideas verbally and in writing (technical proposals, design specs, architecture diagrams and presentations)
  • BS in Computer Science, or equivalent experience
  • Experience in Go, Java, Python.

Nice To Haves

  • MS in Computer Science
  • Experience building control plane/data plane solutions for cloud native companies
  • Experience in diagnosing, troubleshooting and resolving performance issues in complex environments
  • Deep understanding of Unix-like operating systems
  • Production experience with Cloud and ML technologies
  • Generative AI, LLM, Machine learning experience

Responsibilities

  • Build cloud service on top of the modern Infrastructure as a Service (IaaS) building blocks at OCI
  • Design and build distributed, scalable, fault tolerant software systems
  • Participate in the entire software lifecycle – development, testing, CI and production operations
  • Design and lead software projects without needing significant guidance and guide/mentor/coach junior engineers
  • Balance between product feature development and production operational concerns like writing runbooks, ops automation, structured logging, instrumentation for metrics and events
  • Leverage internal tooling at OCI to develop, build, deploy and troubleshoot software
  • Participate in on-call for the service with the team

Benefits

  • Medical, dental, and vision insurance, including expert medical opinion
  • Short term disability and long term disability
  • Life insurance and AD&D
  • Supplemental life insurance (Employee/Spouse/Child)
  • Health care and dependent care Flexible Spending Accounts
  • Pre-tax commuter and parking benefits
  • 401(k) Savings and Investment Plan with company match
  • Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
  • 11 paid holidays
  • Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
  • Paid parental leave
  • Adoption assistance
  • Employee Stock Purchase Plan
  • Financial planning and group legal
  • Voluntary benefits including auto, homeowner and pet insurance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service