MLOps Engineer – Data Center Customer Engineering

Qualcomm•San Diego, CA

11d

About The Position

The Data Center Customer Engineering team is seeking experienced MLOps Engineers to lead deployment and optimization of rack-scale deep learning workloads powered by Qualcomm Cloud AI inference accelerators. These accelerators leverage Qualcomm's expertise in hardware-accelerated AI to power high-performance, energy-efficient generative AI and computer vision inference workloads in the data center. In this role, you will collaborate closely with strategic partners and customers to drive seamless provisioning, orchestration, optimization, monitoring, and lifecycle management of end‑to-end deep learning inference pipelines on Qualcomm's Cloud AI data center deployments. Ideal candidates will bring a strong foundation in ML model deployment, systems engineering, rack-scale management software, DevOps/MLOps automation, and cross‑functional collaboration. This role involves the following activities: Ensure optimal performance, uptime and availability of Cloud AI data center deployments Manage Qualcomm Cloud AI Accelerator hardware for AI/ML workloads Commission and decommission equipment Oversee physical infrastructure: servers, storage, networking, power, cooling Deploy and maintain infrastructure-as-code tools Monitor and manage incident response, troubleshooting, root cause analysis, preventative measures Oversee software updates and maintenance Monitor usage trends and plan for infrastructure scaling Manage relationships with hardware, software service vendors Coordinate with internal teams: IT, engineering, security Provide regular reports on uptime, incidents, capacity and performance metrics Track KPIs and SLAs Ensure redundancy and failover mechanisms Document and enforce standard operating procedures Candidates for this position will demonstrate the following: Understanding of AI/ML inference workloads Strong problem-solving and diagnostic abilities Ability to work in high-pressure environments and respond to incidents quickly Strong attention to detail with a focus on quality and reliability Good communication skills for coordinating with teams Hands-on experience installing, troubleshooting and maintaining servers, storage devices, networking equipment and AI accelerators. Familiarity with Linux, bare-metal and virtualization platforms Familiarity with data center infrastructure Use of Data Center Infrastructure Management software and environmental monitoring systems Commitment to ongoing learning and professional development Knowledge of programming/scripting languages like Python or Bash Familiarity with cloud platforms

Requirements

Understanding of AI/ML inference workloads
Strong problem-solving and diagnostic abilities
Ability to work in high-pressure environments and respond to incidents quickly
Strong attention to detail with a focus on quality and reliability
Good communication skills for coordinating with teams
Hands-on experience installing, troubleshooting and maintaining servers, storage devices, networking equipment and AI accelerators.
Familiarity with Linux, bare-metal and virtualization platforms
Familiarity with data center infrastructure
Use of Data Center Infrastructure Management software and environmental monitoring systems
Commitment to ongoing learning and professional development
Knowledge of programming/scripting languages like Python or Bash
Familiarity with cloud platforms
Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Applications Engineering, Software Development experience, or related work experience.
OR Master's degree in Engineering, Information Systems, Computer Science, or related field and 1+ year of Software Applications Engineering, Software Development experience, or related work experience.
OR PhD in Engineering, Information Systems, Computer Science, or related field.
1+ year of any combination of academic and/or work experience with Programming Language such as C, C++, Java, Python, etc.
1+ year of any combination of academic and/or work experience with debugging techniques.

Responsibilities

Ensure optimal performance, uptime and availability of Cloud AI data center deployments
Manage Qualcomm Cloud AI Accelerator hardware for AI/ML workloads
Commission and decommission equipment
Oversee physical infrastructure: servers, storage, networking, power, cooling
Deploy and maintain infrastructure-as-code tools
Monitor and manage incident response, troubleshooting, root cause analysis, preventative measures
Oversee software updates and maintenance
Monitor usage trends and plan for infrastructure scaling
Manage relationships with hardware, software service vendors
Coordinate with internal teams: IT, engineering, security
Provide regular reports on uptime, incidents, capacity and performance metrics
Track KPIs and SLAs
Ensure redundancy and failover mechanisms
Document and enforce standard operating procedures