About The Position

The Data Center Customer Engineering team is seeking experienced MLOps Engineers to lead deployment and optimization of rack-scale deep learning workloads powered by Qualcomm Cloud AI inference accelerators. These accelerators leverage Qualcomm's expertise in hardware-accelerated AI to power high-performance, energy-efficient generative AI and computer vision inference workloads in the data center. In this role, you will collaborate closely with strategic partners and customers to drive seamless provisioning, orchestration, optimization, monitoring, and lifecycle management of end‑to-end deep learning inference pipelines on Qualcomm's Cloud AI data center deployments. Ideal candidates will bring a strong foundation in ML model deployment, systems engineering, rack-scale management software, DevOps/MLOps automation, and cross‑functional collaboration. This role involves the following activities: Ensure optimal performance, uptime and availability of Cloud AI data center deployments Manage Qualcomm Cloud AI Accelerator hardware for AI/ML workloads Commission and decommission equipment Oversee physical infrastructure: servers, storage, networking, power, cooling Deploy and maintain infrastructure-as-code tools Monitor and manage incident response, troubleshooting, root cause analysis, preventative measures Oversee software updates and maintenance Monitor usage trends and plan for infrastructure scaling Manage relationships with hardware, software service vendors Coordinate with internal teams: IT, engineering, security Provide regular reports on uptime, incidents, capacity and performance metrics Track KPIs and SLAs Ensure redundancy and failover mechanisms Document and enforce standard operating procedures Candidates for this position will demonstrate the following: Understanding of AI/ML inference workloads Strong problem-solving and diagnostic abilities Ability to work in high-pressure environments and respond to incidents quickly Strong attention to detail with a focus on quality and reliability Good communication skills for coordinating with teams Hands-on experience installing, troubleshooting and maintaining servers, storage devices, networking equipment and AI accelerators. Familiarity with Linux, bare-metal and virtualization platforms Familiarity with data center infrastructure Use of Data Center Infrastructure Management software and environmental monitoring systems Commitment to ongoing learning and professional development Knowledge of programming/scripting languages like Python or Bash Familiarity with cloud platforms

Requirements

  • Understanding of AI/ML inference workloads
  • Strong problem-solving and diagnostic abilities
  • Ability to work in high-pressure environments and respond to incidents quickly
  • Strong attention to detail with a focus on quality and reliability
  • Good communication skills for coordinating with teams
  • Hands-on experience installing, troubleshooting and maintaining servers, storage devices, networking equipment and AI accelerators.
  • Familiarity with Linux, bare-metal and virtualization platforms
  • Familiarity with data center infrastructure
  • Use of Data Center Infrastructure Management software and environmental monitoring systems
  • Commitment to ongoing learning and professional development
  • Knowledge of programming/scripting languages like Python or Bash
  • Familiarity with cloud platforms
  • Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Applications Engineering, Software Development experience, or related work experience.
  • OR Master's degree in Engineering, Information Systems, Computer Science, or related field and 1+ year of Software Applications Engineering, Software Development experience, or related work experience.
  • OR PhD in Engineering, Information Systems, Computer Science, or related field.
  • 1+ year of any combination of academic and/or work experience with Programming Language such as C, C++, Java, Python, etc.
  • 1+ year of any combination of academic and/or work experience with debugging techniques.

Responsibilities

  • Ensure optimal performance, uptime and availability of Cloud AI data center deployments
  • Manage Qualcomm Cloud AI Accelerator hardware for AI/ML workloads
  • Commission and decommission equipment
  • Oversee physical infrastructure: servers, storage, networking, power, cooling
  • Deploy and maintain infrastructure-as-code tools
  • Monitor and manage incident response, troubleshooting, root cause analysis, preventative measures
  • Oversee software updates and maintenance
  • Monitor usage trends and plan for infrastructure scaling
  • Manage relationships with hardware, software service vendors
  • Coordinate with internal teams: IT, engineering, security
  • Provide regular reports on uptime, incidents, capacity and performance metrics
  • Track KPIs and SLAs
  • Ensure redundancy and failover mechanisms
  • Document and enforce standard operating procedures
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service