About The Position

NVIDIA’s Hardware Infrastructure organization seeks a high-caliber Software Engineer to join the Data & Observability Platform team. We provide the data backbone for NVIDIA’s GPU/CPU development and AI research teams. Our platform manages large-scale telemetry that supports observability, reliability analysis, and capacity forecasting for EDA datacenters. We are looking for a software engineer with excellent coding skills and full-stack experience. You should be proficient in software engineering principles and experienced with distributed systems/data-intensive infrastructure.

Requirements

  • BS or MS in Computer Science, Electrical Engineering, or a related technical field or equivalent experience.
  • 2+ years of experience writing production-grade code.
  • High proficiency in Python or Java/Scala, with an emphasis on writing testable, performant, and asynchronous code.
  • A deep understanding of Distributed Systems and Linux internals.
  • Experience building end-to-end features from the ground up, including backend APIs development and schemas to implementing functional frontends.
  • Proficiency in SQL and an understanding of data modeling, read/write tradeoffs, query execution patterns, and optimized data formats.
  • Comfortable debugging production issues, distributed traces, and memory leaks in a high-scale environment.
  • Ability to break down ambiguous problems into simpler, executable components in a fast-paced environment.

Nice To Haves

  • Experience with frameworks/tools like Spark, Kafka, Trino.
  • Familiarity with partition strategies, offset management, and backpressure.
  • Hands-on experience with Kubernetes, service deployments, Infrastructure as Code (Terraform), and cloud administration.
  • Familiarity with Prometheus, Grafana, or OpenSearch, and the ability to build custom exporters or telemetry dashboards.
  • Experience with HPC environments, semiconductor build workflows, or handling large-scale hardware telemetry.

Responsibilities

  • Build and maintain high-throughput services that collect and process metrics, logs, and hardware telemetry from compute clusters.
  • Design and deploy internal web applications, tools, and APIs to configure data pipelines, visualize platform health, and integrate AI workflows.
  • Contribute to modernizing the Data Platform, ensuring it aligns with industry trends and supports our unique needs and use cases.
  • Manage the deployment and lifecycle of services on Kubernetes and Cloud.
  • Partner with users to refactor data schemas, identify latency bottlenecks, and apply storage patterns to reduce operational costs.
  • Ensure high platform availability through detailed on-call practices and proactive monitoring of complex data dependencies.

Benefits

  • You will also be eligible for equity and benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service