Oracle-posted 6 days ago
Full-time • Mid Level
Seattle, WA
5,001-10,000 employees

Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high-performance network to support AI/ML/HPC workloads. Join us to design systems that scale from tens to hundreds of thousands of GPUs without sacrificing performance. Our team develops and tunes the software and hardware stack for distributed workloads using libraries such as NCCL on high-speed networks. Strong knowledge and practical experience with NCCL is essential for this role. You’ll apply collective communication libraries to tune system performance at a previously unheard-of scale—our approach to scaling is cutting edge. We’re looking for adaptable, self-motivated engineers who learn quickly, write solid code, and work across the stack. Ideal candidates have experience with distributed systems, value scalability and simplicity, and thrive in collaborative, agile environments.

  • 7+ years of experience with software (systems/application) development
  • 2+ years of experience with collective communications libraries like NCCL, RCCL, MPI and GPU frameworks like CUDA and ROCm.
  • 2+ years of experience with ML training frameworks like PyTorch, TensorFlow
  • Proficient at programming in any two out of C/C++, Python, Java, Scala, GO
  • Proficient with data structures, algorithms, operating systems
  • Excellent organizational, verbal, and written communication skills
  • Bachelors in computer science and Engineering or related engineering fields
  • Masters / PhD degree in Computer Science or related engineering fields
  • Experience with RDMA programming, including but not limited to GPUDirect RDMA
  • Experience with distributed workload managers like Slurm or K8s
  • Experience with Linux Performance tools
  • Experience in SDN, NFV, Cloud Networking
  • Experience in Infrastructure-as-a-Service, viz. OpenStack, AWS, GCP, Azure
  • Medical, dental, and vision insurance, including expert medical opinion
  • Short term disability and long term disability
  • Life insurance and AD&D
  • Supplemental life insurance (Employee/Spouse/Child)
  • Health care and dependent care Flexible Spending Accounts
  • Pre-tax commuter and parking benefits
  • 401(k) Savings and Investment Plan with company match
  • Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
  • 11 paid holidays
  • Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
  • Paid parental leave
  • Adoption assistance
  • Employee Stock Purchase Plan
  • Financial planning and group legal
  • Voluntary benefits including auto, homeowner and pet insurance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service