Cloud Support Engineer - Managed Cloud Services

Cadence Design Systems•San Jose, CA

50d•Onsite

About The Position

We are seeking a highly motivated candidate for the position of Cloud Support Engineer with a strong infrastructure background to support our secure, cloud‑based silicon chip design environments used by external customers for mission‑critical EDA, HPC, and containerized workloads. This role is customer‑facing and service‑oriented, requiring deep technical expertise across Linux, cloud infrastructure, and platform operations, along with a strong commitment to responsiveness, professionalism, and delivering an exceptional customer experience. This role is well‑suited for engineers with hands‑on experience operating OpenStack and/or OpenShift platforms, along with traditional infrastructure components such as compute, storage, networking, and identity services. Success is measured not only by technical outcomes, but by customer satisfaction, trust, and confidence in the service. This position involves working with export‑restricted data (ITAR/CUI) and supporting highly secure environments with stringent operational and compliance standards.

Requirements

Strong hands‑on experience with Linux system administration and troubleshooting
Broad infrastructure experience, including compute, storage, networking, and identity services
Experience operating and supporting OpenStack and/or OpenShift (Kubernetes) environments
Experience supporting HPC or large‑scale compute environments
Proficiency in Python, shell scripting, Perl, or similar automation‑focused languages
Experience with monitoring, logging, and alerting platforms
Familiarity with license management systems (e.g., FlexNet / FLEXlm or equivalent)
Demonstrated ability to deliver excellent customer service in a technical support, SRE, or infrastructure operations role
Strong sense of ownership and urgency when addressing customer‑impacting issues
Ability to balance deep technical problem‑solving with clear, customer‑friendly communication
Highly organized and able to manage multiple concurrent customer issues
Ability to work with export‑restricted data (ITAR/CUI)
U.S. Person status or eligibility as required to support export‑controlled environments

Nice To Haves

Experience supporting EDA, semiconductor, or silicon design environments
Experience with cloud‑based or on‑prem HPC platforms (private, public, or hybrid)
Strong background in infrastructure operations, SRE, or platform engineering roles
Experience with configuration management or infrastructure‑as‑code tools (e.g., Ansible, Terraform)
Experience applying AI‑assisted automation in production operations or support contexts

Responsibilities

Serve as a primary technical support contact for external customers using secure cloud‑based silicon design and HPC platforms
Deliver timely, responsive, and high‑quality support, ensuring customer issues are acknowledged, communicated, and resolved effectively
Proactively minimize downtime, anticipate customer needs, and resolve issues before they impact workloads
Clearly communicate complex technical issues, status updates, and resolutions to customers with varying levels of expertise
Build long‑term customer trust through professionalism, ownership, and consistent follow‑through
Support and troubleshoot Linux‑based infrastructure and cloud environments, including compute, storage, networking, and identity components
Operate and support OpenStack‑based private or hybrid cloud platforms, including core services (Nova, Neutron, Cinder, Glance, Keystone, etc.)
Support OpenShift / Kubernetes platforms, including cluster operations, workload troubleshooting, networking, storage integration, and upgrades
Maintain availability, performance, and reliability of secure multi‑tenant environments
Perform system‑level diagnosis across infrastructure layers to identify root cause and remediation paths
Partner with internal platform and engineering teams to drive stability and performance improvements
Monitor HPC cluster performance, job scheduling, throughput, and queue health
Identify and resolve HPC job performance issues, including scheduler configuration, resource contention, I/O bottlenecks, and memory constraints
Troubleshoot and resolve license availability, utilization, and checkout issues impacting customer workloads
Support distributed resource managers such as Slurm, LSF, SGE, or equivalent schedulers
Design, develop, and maintain automation for recurring operational tasks, including: Infrastructure and platform health monitoring, Capacity tracking and alerting, User provisioning and de‑provisioning, License usage monitoring, Detection of abnormal system, container, or job behavior
Use Python, shell scripting, Perl, or similar tools to reduce manual effort and improve mean time to resolution (MTTR)
Apply AI‑assisted or agentic automation where appropriate to improve operational efficiency and customer experience
Operate and support systems containing ITAR‑controlled and CUI data in compliance with regulatory and corporate requirements
Follow documented security, access control, auditing, and change management procedures
Participate in incident response, post‑incident root cause analysis, and corrective action planning
Create and maintain runbooks, knowledge base articles, and customer‑facing documentation