Senior Staff Site Reliability Engineer - Cloud Platform Engineering

NvidiaSanta Clara, CA
385d$164,000 - $310,500

About The Position

As a Sr. Staff Engineer at NVIDIA, you will lead the design and development of scalable cloud platforms across multiple public cloud providers. This role involves operating existing infrastructure with a focus on reliability and security, collaborating with various engineering teams, and mentoring peers. You will be responsible for implementing cloud systems and architectures, automating processes, and ensuring high availability and performance of cloud services.

Requirements

  • Bachelor's and/or Master's or equivalent experience in Computer Science or related field of study.
  • 8+ years of experience in Software Development and/or Site Reliability Engineering/Production Engineering.
  • Strong software development experience using Python.
  • Experience with multiple cloud service providers (at least two): AWS, OCI, Azure and/or GCP.
  • Infrastructure as Code (IaC) automation experience using Terraform CDK or a similar technology.
  • Source code management experience with GitLab, GitHub.
  • Strong systems engineering background in Linux or Windows.
  • Proficiency in simple yet efficient systems design.
  • Strong understanding of network design and architecture.
  • Strong communication skills with the ability to understand and explain technical issues to a non-technical audience.

Nice To Haves

  • Experience deploying and operating Kubernetes clusters.
  • Scaling and managing distributed systems.
  • Significant experience with monitoring and observability platforms such as Datadog.
  • Comprehensive understanding of web applications security.

Responsibilities

  • Lead Cloud Platform Engineering initiatives, including developing, designing, automating, improving and sustaining standard platform services (on-premises and cloud).
  • Develop and implement cloud systems and architectures on AWS, OCI, Azure, and GCP.
  • Design and implement monitoring and alerting strategies to ensure uptime and reduce MTTD.
  • Automate manual processes to improve efficiency and reduce human error.
  • Practice Agile development methodologies with iterative releases of fully functional solutions, including remediation of existing tech debt.
  • Mentor and up-skill engineering peers and colleagues in the operational organization.
  • Collaborate with other engineering teams across NVIDIA to drive the execution of meaningful products for the business.
  • Help in continuously setting the standard of code quality and infrastructure design.
  • Participate in hiring across the organization.

Benefits

  • Highly competitive salaries
  • Comprehensive benefits package
  • Equity eligibility
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service