Software Developer 4

OracleNashville, TN
20h

About The Position

Here at OCI we’re building the world’s largest AI clusters and we’re the fastest at bringing them to customers. The Strategic Customers, Engineering team (SCE) at OCI is tasked with managing the relationships with some of our most significant AI Infra customers, who are leading the innovation in AIML applications, and also the key drivers of our revenue. We are looking for a highly skilled GPU systems engineer for validating GPU performance and scalability on customer-representative systems hosted within OCI. You will interact closely with OCI GPU teams & partners as well as internal hardware and software development teams to drive customer GPU deliveries, to enhance our AI infrastructure to deliver exceptional customer experience and peak performance. You will also collaborate and supporting internal and external stakeholders in diagnosing performance, benchmark-related issues.

Requirements

  • BS or MS in Computer Engineering, Computer Science, or related field, with 6+ years in Cloud infrastructure space.
  • Solid understanding of cloud services, especially around compute, network and storages, as well as GPU architecture fundamentals
  • Experience with multi-GPU and distributed systems.
  • Hands-on experience with market-leading GPUs or AI platforms spanning development, bring-up, test, and characterization
  • Hands-on experience running and analyzing GPU benchmarks
  • Proficiency in Python, Bash, or similar scripting languages
  • Experience with modern server platforms across x86 and ARM architectures
  • Experience scripting and customizing diagnostics, validation, and test workflows
  • Experience with GPU supplier test code and open-source AI test and characterization tools
  • Experience with system integration, validation, and performance characterization
  • Demonstrated ability to debug and root-cause complex hardware and software issues
  • Proven ability to provide cross-functional technical leadership and collaborate effectively with internal teams and external partners
  • Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes
  • Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders
  • Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization

Responsibilities

  • Perform performance characterization on multi-GPU and multi-node systems
  • Validate NCCL scalability across multi GPUs/Nodes/Clusters, Ensure benchmark results are correct, repeatable, and statistically valid
  • Validate system configurations including GPU topology, PCIe, NVLink, NVSwitch, and network fabrics
  • Compare measured NCCL performance against expected bandwidth and latency models
  • Ensure GPU benchmarks are correctly validated against CPV
  • Identify performance regressions across driver, firmware, CUDA, and NCCL releases
  • Debug NCCL performance issues related to GPU topology and affinity Network interconnects (InfiniBand, RoCE) CUDA, drivers, and system software
  • Use NVIDIA profiling and debugging tools (Nsight Systems, Nsight Compute)
  • Assist customers with benchmark setup, configuration, and best practices
  • Provide actionable performance insights and recommendations
  • Support and guide customer on system integration, performance testing and characterization
  • Provide technical support for internal teams and external customers on benchmark and performance issues
  • Collaborate and troubleshoot with service teams on architecture, driver, CUDA, NCCL, and networking related issues
  • Reproduce and debug customer-reported performance problems
  • Communicate findings clearly through reports, documentation, and presentations
  • Support capacity program delivery and technical engagement & planning
  • You will assist OCI service teams and partner teams like Nvidia in the root-cause of potential hardware or software bug
  • Be the voice of customers to OCI’s various cloud engineering teams

Benefits

  • Medical, dental, and vision insurance, including expert medical opinion
  • Short term disability and long term disability
  • Life insurance and AD&D
  • Supplemental life insurance (Employee/Spouse/Child)
  • Health care and dependent care Flexible Spending Accounts
  • Pre-tax commuter and parking benefits
  • 401(k) Savings and Investment Plan with company match
  • Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
  • 11 paid holidays
  • Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
  • Paid parental leave
  • Adoption assistance
  • Employee Stock Purchase Plan
  • Financial planning and group legal
  • Voluntary benefits including auto, homeowner and pet insurance
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service