Senior Software Engineer, Cloud Functions

NvidiaSanta Clara, CA
106d$184,000 - $287,500

About The Position

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for great people like you to help us accelerate the next wave of artificial intelligence. The team delivers NVIDIA Mission Control Software that runs on superpods. The software we develop is shipped as an autonomous hardware recovery engine and is responsible for baseline validation tests, taking remedial actions (break/fix workflows), and periodic health checks for hardware components. We are looking for a Senior Software Engineer with experience in building highly scalable and robust enterprise software to join us. We are building and improving a powerful platform that will automate the diagnosis and repair of a cluster of GPUs or CPUs across public clouds, private clouds, and virtual and physical hardware.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • Keen interest in driving Agent AI projects.
  • 10 years of equivalent experience.
  • Demonstrated ability in building scalable and robust distributed systems.
  • Proven record of product rollouts and collaborating with early adopters.
  • Proficiency in programming in C/C++, Java, Rust or Go.
  • Technical stewardship of projects across the organization.

Nice To Haves

  • Deep understanding of multi-threading and distributed systems concepts.
  • Excellent track record of delivering projects.
  • Expertise in optimizing SQL queries.
  • Expert-level knowledge of Go/Rust programming.

Responsibilities

  • Designing and implementing scalable and reliable software components to enable the core platform to maintain an inventory of resources, including hosts, GPUs, and switches; to automate actions to diagnose failures, and to repair.
  • Enabling Agentic AI within the core platform to create remedial workflows.
  • Influencing the product roadmap in collaboration with teams across various departments with the goal of reducing SRE toil and improving hardware utilization.
  • Collaborating with various organizations across Nvidia to drive adoption of the platform in order to improve GPU utilization.
  • Defining and running benchmarks for various subsystems.
  • Leading and delivering high-impact projects with high quality, performance, and stability with the lowest resource consumption.
  • Developing a robust feedback control system that analyzes signals about system health and automatically runs commands to fix discovered issues.
  • Programming in modern languages like Go and Rust.

Benefits

  • Competitive salaries.
  • Generous benefits package.
  • Equity eligibility.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Computer and Electronic Product Manufacturing

Education Level

Bachelor's degree

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service