Senior Systems Software Engineer, AV Infrastructure - Validation and Distributed Systems

NVIDIA•Us, CA

11d

About The Position

NVIDIA has become the platform upon which every new AI-powered application is built. From healthcare research applications to autonomous vehicles, or voice-recognition systems, the need for advanced perception and cognitive capabilities is exploding, and NVIDIA is right in the center of this revolution. We are seeking a motivated Senior Systems Software Engineer to join our Autonomous Vehicle Infrastructure organization, focusing on building, deploying, and operating validation platforms at scale. In this role, you will work with internal teams and external partners to integrate distributed systems, manage large-scale data pipelines, and operationalize next-generation validation workflows for autonomous driving. This role offers a chance to start from the ground up: standing up new vendor-provided platforms, validating integration paths, and ensuring infrastructure is reliable, secure, and production-ready. You will combine hands-on engineering, infrastructure deployment, and workflow automation to help scale our AV validation ecosystem.

Requirements

BS/MS in Computer Science, Computer Engineering, or relevant field (or equivalent experience).
5+ years of professional experience in infrastructure, distributed systems, or platform engineering.
Hands-on experience with Linux systems, Kubernetes/Docker, Terraform, and CI/CD pipelines.
Strong scripting/development skills in Python, Bash, and exposure in C++ and/or GoLang.
Familiarity with Bazel build/test automation frameworks.
Experience in data/log ingestion workflows and distributed compute/storage systems.
Strong debugging, problem-solving, and communication skills to work across internal and vendor teams.
Proven comfort leveraging AI based development tools, such as Claude Code and Cursor.

Nice To Haves

Strong experience in large-scale distributed systems or GPU/CPU cluster deployments, infrastructure automation, data pipelines, and AWS.
Prior experience with scenario-based validation platforms or AV simulation ecosystems.
Strong knowledge of logging/monitoring/alerting frameworks (Prometheus, Grafana, ELK stack).
Experience working directly with external vendors to integrate platforms and operationalize SLAs.
Proactive use of AI/ML techniques to accelerate log analysis, coverage metrics, or integration workflows.

Responsibilities

Deploy and operationalize vendor-provided platforms in our cloud-based service platform, starting with test environments to validate dependencies, workflows, and performance.
Build and maintain distributed infrastructure that supports large-scale log ingestion, data processing, and scenario validation at scale.
Automate workflows and pipelines using Go, Python, Bash, and Bazel to ensure reproducibility, efficiency, and reliable distributed execution.
Integrate simulation and drive logs (e.g. world model data, road geometries) in various formats (e.g. protobuf, parquet) with validation platforms, ensuring seamless end-to-end coverage analysis.
Provide visualization and reporting capabilities to surface validation results, coverage metrics, and actionable insights for developers and stakeholders.
Define and manage access controls, monitoring, and security policies to ensure compliance while enabling smooth collaboration across internal and vendor teams.
Partner closely with internal teams and external vendors to troubleshoot issues, refine SLAs, and continuously improve operational reliability and scalability.