Site Reliability Engineer, HPC Infrastructure

TeslaPalo Alto, CA
1d$164,480 - $246,720

About The Position

Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware & silicon design. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex. Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As the scope and impact of our Optimus, Full-Self-Driving (FSD) & Robotaxi efforts continue to scale, so does the value of this team and its work. As a Site Reliability Engineer, you will be responsible for maintaining and improving our platform to ensure our Full-Self-Driving (FSD) & Optimus engineering teams have the necessary tools and resources to be productive. This includes managing/operating our AI infrastructure, monitoring compute/GPU/network metrics, Linux troubleshooting & performance tuning, and security. Your work will directly facilitate neural network training at scale & streamline FSD development.

Requirements

  • Proficiency in Python, Golang and/or Bash
  • Proficiency with Linux fundamentals and performance optimizations
  • Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.)
  • Experience with containerization technologies such as Kubernetes
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field
  • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position

Nice To Haves

  • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus
  • Experience with Slurm, LSF and storage management of parallel file systems is a plus

Responsibilities

  • Support the AI/ML cluster infrastructure on GPU platforms, focusing on systems automation, configuration management and deployment at scale
  • Improve our monitoring & self-healing pipelines, as well as security posture
  • Optimize our server, storage and network performance
  • Develop new tools in Python, Golang or Bash/Shell
  • Use Infrastructure as Code best practices
  • Participate in 24x7 on-call rotation

Benefits

  • Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:
  • Medical plans > plan options with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
  • Company Paid (Health Savings Accounts) HSA Contribution when enrolled in the High-Deductible medical plan with HSA
  • Healthcare and Dependent Care Flexible Spending Accounts (FSA)
  • 401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits
  • Company paid Basic Life, AD&D
  • Short-term and long-term disability insurance (90 day waiting period)
  • Employee Assistance Program
  • Sick and Vacation time (Flex time for salary positions, Accrued hours for Hourly positions), and Paid Holidays
  • Back-up childcare and parenting support resources
  • Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
  • Weight Loss and Tobacco Cessation Programs
  • Tesla Babies program
  • Commuter benefits
  • Employee discounts and perks program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service