Sr. Staff Engineer

Marvell Technology•Santa Clara, CA

1d•$108,140 - $162,000

About The Position

Support and manage infrastructure for LLM and Agentic AI platforms, including GPU-enabled Linux environments, inference workloads, and scalable AI services. Administer and support HPC (High Performance Computing) and GRID computing environments across both on-premises and cloud infrastructures. Collaborate with AI/ML, DevOps, and platform engineering teams to deploy, optimize, and maintain infrastructure for autonomous AI agents, vector databases, distributed compute clusters, and intelligent automation frameworks. Automate AI infrastructure provisioning, configuration management, and operational workflows using Ansible, shell scripting, and Infrastructure as Code practices. Implement monitoring, observability, and performance optimization solutions for AI/ML workloads, HPC clusters, GRID environments, and Linux-based compute infrastructure. Install, configure, and maintain Linux operating systems including RHEL, Oracle Linux (OEL), and CentOS across multiple hardware platforms. Design and customize OS builds in alignment with business requirements and industry best practices. Manage Linux patching activities to maintain system security, stability, and compliance. Plan and execute routine maintenance activities to optimize system availability and performance. Work closely with security teams to identify, remediate, and prevent system vulnerabilities. Develop, maintain, and enhance Ansible playbooks to automate system administration and configuration management tasks. Identify and implement automation opportunities to improve operational efficiency and reduce manual effort. Lead and support Linux migration projects, ensuring smooth transition of applications and services across environments. Coordinate migration planning, execution, validation, and post-migration support activities. Assist in deploying and managing containerized applications using Docker, Kubernetes, and orchestration platforms. Support hybrid cloud and on-prem infrastructure environments for compute-intensive and AI-driven workloads. Support CI/CD pipelines for infrastructure and application deployments in hybrid or cloud-native environments. Create, maintain, and update technical documentation for configurations, procedures, and operational processes.

Requirements

Experience supporting infrastructure for LLM, Agentic AI, or AI/ML platforms, including GPU-based Linux environments and scalable compute infrastructure.
Hands-on experience managing HPC (High Performance Computing) and GRID environments across on-premises and cloud platforms.
Familiarity with distributed computing frameworks, GPU scheduling, cluster management, and workload orchestration technologies.
Strong expertise in Linux system administration with hands-on experience in RHEL, CentOS, and Oracle Linux environments.
Proficiency in infrastructure automation using Ansible, shell scripting, Python, and Infrastructure as Code methodologies.
Experience with Docker, Kubernetes, and cloud-native deployment models in enterprise environments.
Knowledge of monitoring, observability, and performance optimization tools for distributed systems, HPC clusters, and AI-driven workloads.
Experience supporting hybrid infrastructure environments across AWS, Azure, GCP, or private cloud platforms.
Bachelor’s degree in Computer Science, Information Technology, or a related discipline.
Proven experience as an L2/Linux Systems Administrator in enterprise environments.
Solid understanding of Linux OS installation, configuration, troubleshooting, patching, and performance tuning.
Hands-on experience supporting Linux migration and infrastructure modernization projects.
Familiarity with CI/CD pipelines and DevOps operational practices is an added advantage.
Strong analytical, troubleshooting, and problem-solving capabilities.
Excellent communication and collaboration skills with the ability to work effectively in team-oriented environments.

Nice To Haves

Familiarity with CI/CD pipelines and DevOps operational practices is an added advantage.

Responsibilities

Support and manage infrastructure for LLM and Agentic AI platforms, including GPU-enabled Linux environments, inference workloads, and scalable AI services.
Administer and support HPC (High Performance Computing) and GRID computing environments across both on-premises and cloud infrastructures.
Collaborate with AI/ML, DevOps, and platform engineering teams to deploy, optimize, and maintain infrastructure for autonomous AI agents, vector databases, distributed compute clusters, and intelligent automation frameworks.
Automate AI infrastructure provisioning, configuration management, and operational workflows using Ansible, shell scripting, and Infrastructure as Code practices.
Implement monitoring, observability, and performance optimization solutions for AI/ML workloads, HPC clusters, GRID environments, and Linux-based compute infrastructure.
Install, configure, and maintain Linux operating systems including RHEL, Oracle Linux (OEL), and CentOS across multiple hardware platforms.
Design and customize OS builds in alignment with business requirements and industry best practices.
Manage Linux patching activities to maintain system security, stability, and compliance.
Plan and execute routine maintenance activities to optimize system availability and performance.
Work closely with security teams to identify, remediate, and prevent system vulnerabilities.
Develop, maintain, and enhance Ansible playbooks to automate system administration and configuration management tasks.
Identify and implement automation opportunities to improve operational efficiency and reduce manual effort.
Lead and support Linux migration projects, ensuring smooth transition of applications and services across environments.
Coordinate migration planning, execution, validation, and post-migration support activities.
Assist in deploying and managing containerized applications using Docker, Kubernetes, and orchestration platforms.
Support hybrid cloud and on-prem infrastructure environments for compute-intensive and AI-driven workloads.
Support CI/CD pipelines for infrastructure and application deployments in hybrid or cloud-native environments.
Create, maintain, and update technical documentation for configurations, procedures, and operational processes.