Senior DevOps Engineer

NVIDIASanta Clara, CA
80d$168,000 - $270,250

About The Position

NVIDIA is seeking a passionate, motivated and technical Architect/Engineer to join its dynamic and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Principal DevOps & SRE Engineer to support the design and implementation of AI tools solutions on Kubernetes for the company's Cloud Platform. The position will be part of a fast-paced crew that develops and maintains sophisticated build & test environments for a multitude of hardware platforms both NVIDIA GPUs and Tegra Processors along with various operating systems (Windows/Linux/Android). The team works with various other business units within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, Robotics and Autonomous cars to cater to their infrastructure & system's needs.

Requirements

  • Kubernetes domain expertise with extensive experience building scalable, resilient platforms in both public and private cloud capable of providing platform engineering / architecture standard methodologies.
  • Experience of maintaining cloud infrastructure (On-prem & CSP) and highly available production environment.
  • Strong Programming background in python and/or similar scripting languages.
  • Excellent problem solving, communication, and teamwork skills.
  • Strong understanding of architectural requirements and development processes involved in building reliable, robust, scalable data products and pipelines.
  • Demonstrating the ability to automate processes using Continuous Integration /Continuous Delivery (CI/CD) tools.
  • Proficient in using Configuration as Code, infrastructure-as-code tools such as ansible, puppet, chef & terraform.
  • Strong background with Gitlab, GitHub, Perforce, Jenkins and/or other CI/CD systems & Artifactory.
  • Experienced with data analytics/visualization & monitoring tools like Kibana, Grafana, Splunk, Zabbix, Prometheus and/or similar systems.
  • Experience in Databases both SQL (MySQL) and NoSQL (Elastic Search /MongoDB/Cassandra).
  • 10+ years of proven experience with Bachelor’s or Master’s degree in computer science, Software Engineering, or equivalent experience.

Nice To Haves

  • Solid understanding of containerization and microservices architecture.
  • Certified Kubernetes Administrator (CKA), Certified Kubernetes Security Specialist (CKS) & Certified Kubernetes Application Developer (CKAD) preferred.
  • Prior experience on implementation and management of Trustworthy AI tools (QuantPi, Credo AI, Armilla AI), Coding Assistance AI tools (Cursor, Sourcegraph Cody) & code review AI tools (CodeRabbit).
  • Thrives in a multi-tasking environment with constantly evolving priorities.
  • Ability to analyze complex problems into simple sub problems and then reuse available solutions to implement most of those.
  • Ability to design simple systems that can work efficiently without needing much support.
  • Prior experience with large scale operations team.
  • Experience with using and improving data centers.
  • Background with computer algorithms and ability to choose the best possible algorithms to meet the scaling challenge.

Responsibilities

  • Craft the overall architecture for integrating coding assistance & Trustworthy AI tools into the existing infrastructure, ensuring alignment with reliable, scalable, and secure standard methodologies.
  • Design for scalability, ensuring the implementation can support current and future workloads without degrading system performance.
  • Identify and automate repetitive or toilsome production tasks related to code deployment, validation, and review, leveraging coding assistance tools to improve operational efficiency.
  • Implement robust monitoring and observability for coding assistance/Trustworthy AI tools & application services, ensuring their availability and performance within the production environment.
  • Integrate security best practices throughout the development lifecycle, ensuring coding assistance tools do not introduce vulnerabilities or compliance risks.
  • Collaborate closely with software engineers, product teams and security teams to align the coding assistance/Trustworthy AI tool’s capabilities with organizational goals and developer needs.
  • Establish feedback mechanisms to gather insights from developers, product/engineering teams on the effectiveness of coding assistance/Trustworthy AI tools, iterating on integrations and configurations for continuous improvement.
  • Maintain comprehensive documentation for architecture decisions, integration processes, operational runbooks, and troubleshooting guides.

Benefits

  • Competitive salaries
  • Generous benefits package
  • Equity eligibility
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service