Staff Engineer Engineering Compute Infrastructure and Grid Operations

Marvell TechnologyWestborough, MA
$128,000 - $189,370

About The Position

We are seeking a Senior Engineer to design, operate, and continuously improve the engineering compute infrastructure used for large-scale chip design and verification. This role is heavily focused on grid job management, storage systems, reliability, and operational excellence in high-throughput compute environments. The ideal candidate has strong IT and systems skills, deep experience with batch schedulers and distributed storage, and a passion for diagnosing and preventing large-scale job failures that impact engineering productivity.

Requirements

  • Bachelor’s degree in computer science, Computer Engineering, Electrical Engineering, or equivalent experience.
  • 8+ years of experience in compute infrastructure, grid operations, or large-scale engineering environments.
  • Strong experience with grid or batch schedulers (e.g., LSF, UGE, Slurm, PBS).
  • Hands-on experience debugging distributed systems and batch job failures.
  • Strong Linux systems knowledge, including process management and resource monitoring.
  • Experience with shared storage systems (NFS, enterprise filers, high-performance filesystems).
  • Strong scripting skills in Python, shell, or similar languages.

Nice To Haves

  • Experience supporting EDA or engineering compute workloads.
  • Familiarity with job controller or wrapper-based execution architectures.
  • Experience operating environments with thousands of concurrent batch jobs.
  • Exposure to cloud or hybrid compute environments.
  • Prior involvement in grid or filesystem migrations.
  • Strong incident response and post-mortem leadership skills.

Responsibilities

  • Own and evolve grid job management infrastructure used for large regressions and high-volume batch workloads.
  • Debug and resolve grid job failures, including scheduling issues, hung jobs, resource starvation, and intermittent infrastructure faults.
  • Improve job reliability through watchdogs, retries, heartbeats, timeouts, and failure detection mechanisms.
  • Work with job controllers and wrapper layers to ensure consistent behavior across grid environments (e.g., LSF, UGE).
  • Partner with IT and compute teams during grid migrations, upgrades, and expansions.
  • Develop deep operational understanding of shared engineering storage systems used by compute jobs.
  • Diagnose and resolve issues related to I/O performance, file contention, permissions, and cross-mounted filesystems.
  • Identify and mitigate storage-related failure modes that cause job instability or data corruption.
  • Collaborate with IT teams on filesystem migrations, maintenance windows, and outage prevention.
  • Proactively identify systemic issues that lead to grid instability or job loss.
  • Design and deploy monitoring, logging, and metrics to detect infrastructure problems early.
  • Perform root-cause analysis of complex, intermittent failures affecting compute, storage, or networking.
  • Define best practices and guardrails to prevent repeat incidents and improve overall system robustness.
  • Act as a technical bridge between engineering users, tools teams, and central IT.
  • Translate engineering workload requirements into actionable infrastructure improvements.
  • Communicate clearly during incidents, maintenance events, and post-mortems.
  • Document operational procedures and share knowledge to reduce support burden.

Benefits

  • employee stock purchase plan with a 2-year look back
  • family support programs
  • robust mental health resources
  • recognition and service awards

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

Associate degree

Number of Employees

1,001-5,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service