Engineering Manager, Observability Platform

NVIDIA•Santa Clara, CA

32d

About The Position

At NVIDIA, we pride ourselves on data-driven decision-making, and the data science platform team is at the heart of this initiative. NVIDIA runs some of the most demanding AI, data, and platform workloads on the planet and none of it works without a reliable, high-scale observability foundation. We’re hiring an Engineering Manager to lead the team that builds and operates NVIDIA’s global observability platform: the system that carries every metric, log, trace, profile, and event our engineers rely on to understand and debug their services. This isn’t a traditional people-manager role. You’ll stay close to the technology, guide architecture decisions, review designs and code, and help the team solve real distributed-systems challenges. You’ll work with engineers to shape how services instrument themselves, how we ingest and store high-cardinality telemetry, and how observability fits cleanly into NVIDIA’s broader platform ecosystem. You’ll partner directly with platform, infrastructure, and application teams to evolve how telemetry flows across metrics, logs, traces, profiling, and events. You’ll coach and mentor engineers, build strong technical habits, and drive a roadmap that keeps the platform reliable and ready for NVIDIA’s rapid growth. If you enjoy deep technical work, high-throughput pipelines, open-source observability stacks, and helping engineers do the best work of their careers, this role is built for you.

Requirements

Bachelors or Master’s degree in Computer Science or a related technical field (or equivalent experience)
8+ overall years building distributed systems, with a focus on observability and monitoring systems, and 3+ years managing or leading engineers.
Experience with modern observability stacks such as Prometheus, Thanos, Mimir, Loki, OpenSearch, Jaeger, Tempo, or OpenTelemetry or equivalent experience.
Strong foundations in distributed systems concepts including replication, sharding, durability, consensus, and performance tuning.
Hands-on experience designing or scaling ingestion pipelines, time-series engines, trace backends, or log indexing systems, especially in high-cardinality environments.
Ability to read and review Go or Python code and support engineers through technical decision-making.
Clear architectural thinking with a focus on stable APIs, predictable performance, and long-term evolution.
Experience mentoring engineers, improving technical judgment, and contributing to a healthy and inclusive engineering culture.
Strong communication skills and the ability to explain complex challenges with clarity.

Nice To Haves

Experience building or contributing to an observability or telemetry platform used at significant scale.
Contributions to open-source projects such as OpenTelemetry, Prometheus, Loki, Thanos, Tempo, Jaeger, ClickHouse, Mimir, or Elasticsearch.
Experience with high-throughput systems like Kafka, Flink, Spark, or large-scale data collectors.
Deep knowledge of cardinality management, query performance, storage design, or retention optimization.
Experience designing multi-region architectures with a focus on consistency, availability, and data locality.

Responsibilities

Leading a team of engineers who design and build the core services, pipelines, and storage layers behind NVIDIA’s observability platform.
Creating a clear technical direction for the team and supporting work that emphasizes simplicity, performance, and maintainability.
Defining the architecture for distributed ingestion services, time-series storage, log and trace pipelines, query paths, and multi-region data flows.
Partnering with platform, infrastructure, and application teams to define data models, instrumentation patterns, APIs, and integration standards.
Strengthening engineering practices through better tooling, automated tests, schema management, API versioning, documentation, and safe rollout processes.
Helping engineers solve distributed-systems issues including ingestion load, indexing pressure, compaction behavior, query fan-out, and replication patterns.
Driving predictable execution through clear priorities, collaborative planning, and strong alignment across teams.
Representing the observability platform across NVIDIA, gathering feedback, and evolving the system to support future AI workloads.