Senior Service Reliability Operations Administrator

NvidiaSanta Clara, CA
108d$124,000 - $195,500Remote

About The Position

NVIDIA's NGC team is looking for highly motivated System Administrator/DevOps engineers to design, develop and implement a global, dynamic, innovative Service Reliability Operations Center, to provide extraordinary levels of support for our Cloud products and services. As a key member of the CIS Team (Compute Infrastructure Support), you will partner with other key members of our organization including Site Reliability Engineering, Security Operations Center, DevOps teams, and other partners to help make our services capable of providing near 100% availability. On the rare occasion that an incident occurs, you will be our front line to decrease the frequency and duration of any issue. Working in partnership with the development community the CIS team will develop monitors, alarms, and alerts to help make the service more reliable and improve our customer experience.

Requirements

  • 5+ years of experience administering open system servers in a Production environment.
  • 3+ years of experience in demanding Internet, Cloud, or Telecommunications environments in a Systems Administration, DevOps, SRE, or NOC role.
  • B.S. in relevant disciplines or equivalent experience.
  • Expertise using monitoring tools and problem ticketing systems.
  • Strong problem-solving, analytical, and troubleshooting abilities.
  • Strong server administration experience.
  • Knowledge of shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc.
  • RHCE or equivalent level of knowledge.
  • Experience scripting in Python preferred, but not required.
  • Prior experience running virtual machines under open source or commercial hypervisors.
  • Experience operating services running on public or private clouds.
  • Knowledge and understanding of application containers and container orchestration systems.
  • Basic understanding of Git.
  • Experience performing system administration tasks using Ansible.
  • Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.

Nice To Haves

  • Experience with predictive support or diagnostic routines.

Responsibilities

  • Provide services 24/7 with a follow-the-sun environment spanning continents.
  • Report directly to a manager in the United States.
  • Work shifts that may require either a Saturday or Sunday each week.
  • Use alerts and alarms to help prevent issues and incidents.
  • Work with the developer community to develop and implement predictive support or diagnostic routines.
  • Perform systems administration tasks, network administration tasks, and security incident monitoring.
  • Translate understanding of services into runbooks for the team.
  • Update and evolve runbooks as new features and functionality are added.
  • Initiate the incident management procedure when incidents and issues are discovered.
  • Engage with subject matter authorities or service owners as needed to resolve issues.
  • Provide extraordinary service levels for customers.

Benefits

  • Equity and benefits eligibility.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Computer and Electronic Product Manufacturing

Education Level

Bachelor's degree

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service