Staff Engineer Engineering Compute Infrastructure and Grid Operations

Marvell Technology•Westborough, MA

59d•$128,000 - $189,370

About The Position

We are seeking a Senior Engineer to design, operate, and continuously improve the engineering compute infrastructure used for large-scale chip design and verification. This role is heavily focused on grid job management, storage systems, reliability, and operational excellence in high-throughput compute environments. The ideal candidate has strong IT and systems skills, deep experience with batch schedulers and distributed storage, and a passion for diagnosing and preventing large-scale job failures that impact engineering productivity.

Requirements

Bachelor’s degree in computer science, Computer Engineering, Electrical Engineering, or equivalent experience.
8+ years of experience in compute infrastructure, grid operations, or large-scale engineering environments.
Strong experience with grid or batch schedulers (e.g., LSF, UGE, Slurm, PBS).
Hands-on experience debugging distributed systems and batch job failures.
Strong Linux systems knowledge, including process management and resource monitoring.
Experience with shared storage systems (NFS, enterprise filers, high-performance filesystems).
Strong scripting skills in Python, shell, or similar languages.

Nice To Haves

Experience supporting EDA or engineering compute workloads.
Familiarity with job controller or wrapper-based execution architectures.
Experience operating environments with thousands of concurrent batch jobs.
Exposure to cloud or hybrid compute environments.
Prior involvement in grid or filesystem migrations.
Strong incident response and post-mortem leadership skills.

Responsibilities

Own and evolve grid job management infrastructure used for large regressions and high-volume batch workloads.
Debug and resolve grid job failures, including scheduling issues, hung jobs, resource starvation, and intermittent infrastructure faults.
Improve job reliability through watchdogs, retries, heartbeats, timeouts, and failure detection mechanisms.
Work with job controllers and wrapper layers to ensure consistent behavior across grid environments (e.g., LSF, UGE).
Partner with IT and compute teams during grid migrations, upgrades, and expansions.
Develop deep operational understanding of shared engineering storage systems used by compute jobs.
Diagnose and resolve issues related to I/O performance, file contention, permissions, and cross-mounted filesystems.
Identify and mitigate storage-related failure modes that cause job instability or data corruption.
Collaborate with IT teams on filesystem migrations, maintenance windows, and outage prevention.
Proactively identify systemic issues that lead to grid instability or job loss.
Design and deploy monitoring, logging, and metrics to detect infrastructure problems early.
Perform root-cause analysis of complex, intermittent failures affecting compute, storage, or networking.
Define best practices and guardrails to prevent repeat incidents and improve overall system robustness.
Act as a technical bridge between engineering users, tools teams, and central IT.
Translate engineering workload requirements into actionable infrastructure improvements.
Communicate clearly during incidents, maintenance events, and post-mortems.
Document operational procedures and share knowledge to reduce support burden.