Site Reliability Engineer

Sustainable TalentSanta Clara, CA
Onsite

About The Position

Sustainable Talent is partnering with Nvidia, a global leader in computer graphics, PC gaming, and accelerated computing. This full-time contract role is for a Site Reliability Engineer to support Nvidia's IPP (Infrastructure, Planning and Process) team in Santa Clara, CA. IPP is a global organization within NVIDIA Software, collaborating with groups like Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars to meet their infrastructure needs. The cloud services provided by this group execute nearly half a million automated jobs daily on thousands of servers, enhancing the efficiency of thousands of NVIDIA's software engineers worldwide. The cloud environment features a diverse mix of machines and devices with various operating systems (Windows/Linux/Android) and multiple hardware platforms, including NVIDIA GPUs and Tegra Processors. The ideal candidate is passionate about infrastructure, eager to tackle complex issues, build next-generation cloud services, craft innovative solutions, and analyze data to identify and resolve problems.

Requirements

  • Bachelor's or Master's Degree in Computer Science or Software Engineering, or equivalent experience.
  • Experience working in large scale enterprise production systems. 5+ years of professional experience required.
  • Ability to debug and analyze system issues, code to triage, root cause and resolve issues in the infrastructure. Work closely with the platform engineering team in understanding hardware setups.
  • Familiar with maintenance and setup of Linux, Windows hosts
  • Scripting experience with any of Python, Go. Unix shell proficiency.
  • Experience with version control systems like Perforce, GIT.

Nice To Haves

  • Experience with VM and hardware virtualization technologies like VMware, KVM, Hyper-V, Docker and Kubernetes.
  • Background with automating bare metal and VM provisioning.
  • Experience with supporting GPUs, embedded device development, driver development and CUDA/TensorRT applications.
  • Development experience in Chef, Ansible and infrastructure orchestration.

Responsibilities

  • Fleet monitoring & recovery of assets in our private cloud environment that houses several compute servers with NVIDIA GPUs.
  • Specific focus on building and stabilizing our virtualization infrastructure of ESXi, KVM and Hyper-V.
  • Deploy and maintain a large farm of machines using the latest Configuration Management & Infrastructure Automation tools (Chef, Ansible, Terraform).
  • Participate in on-call & rotational L1 support for round-the-clock monitoring and remediation of infrastructure issues (PagerDuty)
  • Analyze and Debug operating system, networking, configuration and performance problems.
  • Assist in roll-out and deployment of infrastructure configurations to supporting the latest NVIDIA hardware and technologies.
  • Contribute to the development of monitoring systems to have fast, reliable and real-time pulse of the various infrastructure subsystems (Zabbix, Big Panda, Grafana)

Benefits

  • full benefits
  • PTO
  • amazing company culture!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service