AI Senior Staff Systems Engineer

CadenceSan Jose, CA
85d$136,500 - $253,500

About The Position

At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology. We are seeking a highly skilled and experienced AI Systems Engineer to join our team. This is a hands-on, senior individual contributor role that will be pivotal in leading the development, operations, and support of our entire AI infrastructure. You will be responsible for the entire lifecycle of our AI systems, from architecting and building high-performance GPU clusters to deploying and optimizing our most advanced AI models and agentic services.

Requirements

  • 10+ years of experience in a senior technical role, with at least 5 years focused on building and operating high-performance computing or AI infrastructure.
  • Expert-level knowledge of NVIDIA GPU architecture and technologies like CUDA and cuDNN.
  • Proven experience with public cloud AI services, specifically managing access, usage, and billing for Azure OpenAI and Google Cloud Platform (GCP) services.
  • Extensive hands-on experience with Docker: image management, container orchestration, and troubleshooting.
  • Proficiency in scripting languages such as Python, Bash, or Perl.
  • Deep expertise in Linux system administration (RHEL preferred), including networking, storage, and performance tuning.
  • Familiarity with user authentication and integration using systems like LDAP or Active Directory.
  • Strong problem-solving and communication skills with the ability to work in a multi-platform, cross-functional, and geographically distributed team.

Nice To Haves

  • Understanding of AI job profiling and tuning (memory, GPU, I/O).
  • Experience administering LSF clusters in a production or research environment.
  • Familiarity with other job schedulers like Slurm is a plus.
  • Experience with LSF Docker integration and job submission using container images.
  • Experience with macOS/AppleSilicon system admin tasks and troubleshooting.

Responsibilities

  • Lead the design and implementation of our next-generation AI infrastructure to support our Agentic AI initiatives.
  • Support and secure the use of public cloud AI services, including Azure OpenAI services and Google Cloud Platform (GCP) services like Gemini.
  • Take a leadership role in the configuration, installation, and optimization of GPU server clusters.
  • Architect and deploy a robust and scalable AI tech stack.
  • Lead the deployment, serving, and optimization of Large Language Models (LLMs).
  • Architect and build production-grade Agentic AI workflows and services.
  • Develop and maintain automation scripts using languages like Python, Bash, or Perl.
  • Act as the final escalation point for the most complex technical issues related to our AI infrastructure.
  • Develop and implement security best practices for our AI systems and data.

Benefits

  • Paid vacation and paid holidays
  • 401(k) plan with employer match
  • Employee stock purchase plan
  • A variety of medical, dental and vision plan options
  • Incentive compensation: bonus, equity, and benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Ambulatory Health Care Services

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service