Senior Data Engineer - EDA Datacenter Analytics and Observability

NVIDIA•Westford, MA

7d•Hybrid

About The Position

NVIDIA’s Hardware Infrastructure organization is seeking a Senior Data Engineer to build and evolve analytics-ready data platforms that power observability, reliability analysis, and capacity forecasting for EDA datacenters. In this role, you will focus on transforming large-scale observability and telemetry data into trusted, well-modeled datasets that enable data scientists, analysts, and engineers to drive insights across global CPU and GPU compute clusters. We work closely with observability, infrastructure, and data science teams to ensure that data from EDA workloads and datacenter hardware is high quality, accessible, and optimized for analytical and predictive use cases. What You’ll Be Doing:

Requirements

MS (preferred) or BS in Computer Science (or equivalent experience) or a related field with at least 10+ years of experience designing, building, and operating large-scale data pipelines and data platforms for distributed systems or infrastructure data
Proficiency in Python and SQL, with experience supporting analytical and exploratory workloads
Hands-on experience with distributed data processing frameworks such as Spark or similar technologies
Familiarity working with observability and telemetry data, including metrics, logs, traces, and time-series data
Experience designing data models and schemas that support flexible analysis and forecasting
Ability to take ownership of data engineering initiatives and drive them end-to-end in collaboration with multi-functional partners
Experience implementing data quality, validation, and monitoring for analytics pipelines
Strong communication and collaboration skills, particularly when collaborating with engineering and infrastructure teams
Adaptability in fast paced environments with evolving analytical and operational needs

Nice To Haves

Experience supporting datacenter infrastructure analytics, hardware reliability programs, or workload performance analysis
Familiarity with EDA workflows, HPC environments, or GPU-accelerated compute platforms
Experience integrating or operating observability stacks (Prometheus, Grafana, Elastic/OpenSearch, Kafka, Spark, or similar tools)
Background in large-scale distributed systems or data platforms
A track record of improving analytics velocity and reliability through better data foundation

Responsibilities

Design, build, and maintain analytics-focused data pipelines that ingest, transform, and curate observability data from EDA datacenters
Develop reliable ingestion pipelines for metrics, logs, traces, and hardware health telemetry generated by large-scale CPU and GPU clusters
Partner with observability engineers to integrate data from tools such as Prometheus, Grafana, Elastic/OpenSearch, and Spark-based platforms into unified analytical datasets
Model and organize data to support exploratory analysis, reliability modeling, forecasting, and long-term trend analysis
Build and optimize batch and streaming workflows that support both near-real-time analytics and historical analysis
Implement data quality checks, validation frameworks, and monitoring to ensure analytical accuracy and consistency
Define data retention, aggregation, and enrichment strategies that balance analysis needs, system performance, and storage costs
Enable self-service analytics by improving data discoverability, documentation, and usability
Collaborate with data scientists and analysts to understand analytical requirements and evolve datasets to support new models and insights
Continuously improve pipeline scalability, reliability, and performance as datacenter footprint and workload complexity grow