Performance Engineer 5/6

NetflixLos Gatos, CA
2d

About The Position

We are looking for a highly experienced Performance Engineer to join our team, focusing on the critical area of GPU infrastructure efficiency and the optimization of large-scale AI/ML workloads. This role is essential to managing our rapidly growing computational footprint, ensuring we deliver maximum performance while optimizing cost and resource utilization. You will be a trusted expert, working at the intersection of infrastructure, ML platforms, and core engineering to drive meaningful impact across the organization.

Requirements

  • 10+ years of experience in systems performance analysis and optimization with a focus on large-scale distributed systems.
  • Deep understanding of GPU architecture, kernels, and ML frameworks.
  • Experience in building and using CPU and GPU profiling and other performance analysis tools.
  • Expertise in identifying and resolving performance bottlenecks within the AI/ML infrastructure and software stack.
  • Experience with container orchestration platforms such as Kubernetes.
  • Experience with performance analysis and optimization in a multi-tenant, cloud-native environment.
  • Strong programming skills in languages such as Python and Java.

Nice To Haves

  • Experience with large language model (LLM) serving and training optimization techniques.
  • Understanding of Linux internals such as resource scheduling, memory management, and I/O for GPU-intensive workloads.
  • Experience with the performance analysis of high-speed networking protocols and interconnect technologies, such as InfiniBand and NVLink.
  • Experience with capacity engineering and cost optimization in a major public cloud environment.
  • Proven track record of contributing to open-source performance tools or research in the field.

Responsibilities

  • Drive efficiency and performance optimization across our large-scale infrastructure.
  • Collaborate with ML Platform and Data Science teams to build and enhance comprehensive profiling, tracing, and observability capabilities for GPU workloads.
  • Analyze and resolve complex performance bottlenecks across the entire stack, including hardware, drivers, OS, Kubernetes/scheduling, networking, storage, and application code.
  • Evaluate and guide the adoption of new GPU architectures, interconnects, and cloud vendor services to maximize performance and cost efficiency within Netflix's AI/ML ecosystem.
  • Share knowledge by documenting best practices, contributing to Netflix Tech Blogs, and presenting at industry and vendor forums.

Benefits

  • Netflix provides comprehensive benefits including Health Plans, Mental Health support, a 401(k) Retirement Plan with employer match, Stock Option Program, Disability Programs, Health Savings and Flexible Spending Accounts, Family-forming benefits, and Life and Serious Injury Benefits.
  • We also offer paid leave of absence programs.
  • Full-time hourly employees accrue 35 days annually for paid time off to be used for vacation, holidays, and sick paid time off.
  • Full-time salaried employees are immediately entitled to flexible time off.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service