About The Position

NVIDIA’s Hardware Infrastructure organization is seeking a Senior Data Engineer to build and evolve analytics-ready data platforms that power observability, reliability analysis, and capacity forecasting for EDA datacenters. In this role, you will focus on transforming large-scale observability and telemetry data into trusted, well-modeled datasets that enable data scientists, analysts, and engineers to drive insights across global CPU and GPU compute clusters. We work closely with observability, infrastructure, and data science teams to ensure that data from EDA workloads and datacenter hardware is high quality, accessible, and optimized for analytical and predictive use cases. What You’ll Be Doing:

Requirements

  • MS (preferred) or BS in Computer Science (or equivalent experience) or a related field with at least 10+ years of experience designing, building, and operating large-scale data pipelines and data platforms for distributed systems or infrastructure data
  • Proficiency in Python and SQL, with experience supporting analytical and exploratory workloads
  • Hands-on experience with distributed data processing frameworks such as Spark or similar technologies
  • Familiarity working with observability and telemetry data, including metrics, logs, traces, and time-series data
  • Experience designing data models and schemas that support flexible analysis and forecasting
  • Ability to take ownership of data engineering initiatives and drive them end-to-end in collaboration with multi-functional partners
  • Experience implementing data quality, validation, and monitoring for analytics pipelines
  • Strong communication and collaboration skills, particularly when collaborating with engineering and infrastructure teams
  • Adaptability in fast paced environments with evolving analytical and operational needs

Nice To Haves

  • Experience supporting datacenter infrastructure analytics, hardware reliability programs, or workload performance analysis
  • Familiarity with EDA workflows, HPC environments, or GPU-accelerated compute platforms
  • Experience integrating or operating observability stacks (Prometheus, Grafana, Elastic/OpenSearch, Kafka, Spark, or similar tools)
  • Background in large-scale distributed systems or data platforms
  • A track record of improving analytics velocity and reliability through better data foundation

Responsibilities

  • Design, build, and maintain analytics-focused data pipelines that ingest, transform, and curate observability data from EDA datacenters
  • Develop reliable ingestion pipelines for metrics, logs, traces, and hardware health telemetry generated by large-scale CPU and GPU clusters
  • Partner with observability engineers to integrate data from tools such as Prometheus, Grafana, Elastic/OpenSearch, and Spark-based platforms into unified analytical datasets
  • Model and organize data to support exploratory analysis, reliability modeling, forecasting, and long-term trend analysis
  • Build and optimize batch and streaming workflows that support both near-real-time analytics and historical analysis
  • Implement data quality checks, validation frameworks, and monitoring to ensure analytical accuracy and consistency
  • Define data retention, aggregation, and enrichment strategies that balance analysis needs, system performance, and storage costs
  • Enable self-service analytics by improving data discoverability, documentation, and usability
  • Collaborate with data scientists and analysts to understand analytical requirements and evolve datasets to support new models and insights
  • Continuously improve pipeline scalability, reliability, and performance as datacenter footprint and workload complexity grow
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service