Senior Staff Site Reliability Engineer - Cloud Platform Engineering

Nvidia•Santa Clara, CA

385d•$164,000 - $310,500

About The Position

As a Sr. Staff Engineer at NVIDIA, you will lead the design and development of scalable cloud platforms across multiple public cloud providers. This role involves operating existing infrastructure with a focus on reliability and security, collaborating with various engineering teams, and mentoring peers. You will be responsible for implementing cloud systems and architectures, automating processes, and ensuring high availability and performance of cloud services.

Requirements

Bachelor's and/or Master's or equivalent experience in Computer Science or related field of study.
8+ years of experience in Software Development and/or Site Reliability Engineering/Production Engineering.
Strong software development experience using Python.
Experience with multiple cloud service providers (at least two): AWS, OCI, Azure and/or GCP.
Infrastructure as Code (IaC) automation experience using Terraform CDK or a similar technology.
Source code management experience with GitLab, GitHub.
Strong systems engineering background in Linux or Windows.
Proficiency in simple yet efficient systems design.
Strong understanding of network design and architecture.
Strong communication skills with the ability to understand and explain technical issues to a non-technical audience.

Nice To Haves

Experience deploying and operating Kubernetes clusters.
Scaling and managing distributed systems.
Significant experience with monitoring and observability platforms such as Datadog.
Comprehensive understanding of web applications security.

Responsibilities

Lead Cloud Platform Engineering initiatives, including developing, designing, automating, improving and sustaining standard platform services (on-premises and cloud).
Develop and implement cloud systems and architectures on AWS, OCI, Azure, and GCP.
Design and implement monitoring and alerting strategies to ensure uptime and reduce MTTD.
Automate manual processes to improve efficiency and reduce human error.
Practice Agile development methodologies with iterative releases of fully functional solutions, including remediation of existing tech debt.
Mentor and up-skill engineering peers and colleagues in the operational organization.
Collaborate with other engineering teams across NVIDIA to drive the execution of meaningful products for the business.
Help in continuously setting the standard of code quality and infrastructure design.
Participate in hiring across the organization.

Benefits

Highly competitive salaries
Comprehensive benefits package
Equity eligibility

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Computer and Electronic Product Manufacturing

Education Level

Bachelor's degree

Senior Staff Site Reliability Engineer - Cloud Platform Engineering

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company