Software Engineer, Performance Tooling and Infrastructure

NuroMountain View, CA
$152,000 - $228,000

About The Position

Nuro leverages many different bench-top systems to evaluate and regression test different aspects of the software and hardware integration layer. This performance simulation platform includes systems At Nuro, every autonomy code change, from ML model updates to radius of map around the robot to number of evaluated trajectories, must be validated for real-time performance on actual robot compute hardware before it reaches the road. You will own the infrastructure that makes this possible. Our Performance Simulation Platform is a hybrid benchmarking system: physical bench-top rigs running production robot compute (NVIDIA Thor platform), orchestrated by cloud-native infrastructure (Kubernetes, GCP), automated data pipelines feeding performance metrics into BigQuery and Grafana, pre/post simulation magic, custom tracing and profiling tools, and much much more. Engineers across the company rely on this platform daily to answer questions like: How will my new ML model affect contention on the GPU? How does a new data format impact onboard logging rate or network contention as more data might be flowing from through the system? How much memory should be allocated for this new module, and how does it fit into the overall system budget? You'll be responsible for development, integration, and the evolution of this platform — from the bare-metal OS and networking layer through the job orchestration and CI/CD integration up to the data analysis and visualization layer. This is a high-ownership, high-autonomy role on a small team where your work directly gates the release velocity of the entire autonomy stack. You'll be the technical DRI for the platform — setting the roadmap, making architectural calls, representing the platform's needs to the leadership team, and ensuring the system scales through multiple hardware generations.

Requirements

  • Experience: 3+ years of industry software engineering experience.
  • Software Engineering: Strong proficiency in Python and working proficiency in C++. You write clean, testable, well-documented code and care about long-term maintainability.
  • Data Engineering: Experience building data pipelines, ingestion, transformation, storage, and visualization. Familiarity with SQL and analytical workflows.
  • Systems & Infrastructure: Deep comfort with Linux systems — you've configured kernels, debugged boot issues, written systemd units, or managed bare-metal infrastructure. You understand networking, storage, and compute at a level beyond "it just works."
  • Technical Leadership: Experience setting technical vision and roadmap for a project or platform, driving alignment across multiple stakeholders. You've independently identified the cross-functional partners needed to unblock and deliver, and you've briefed senior engineering leadership on trade-offs and recommendations.
  • AI-Native: You treat AI as a core part of your engineering workflow, not an occasional shortcut — you use agentic tooling (e.g., Claude Code) across the development lifecycle and you understand the boundaries of when AI output demands extra scrutiny versus when it accelerates you.
  • Bias for Action: Comfortable operating in ambiguous, fast-moving environments where you need to balance long-term architecture with short-term delivery.

Nice To Haves

  • Experience with performance engineering, especially around tooling integration (perf, Perfetto, pprof, eBPF, NVIDIA Nsight Systems, NVIDIA CUPTI).
  • Experience in robotics or AV, particularly with NVIDIA DriveOS stack.

Responsibilities

  • Benchmarking Infrastructure: Develop and maintain the job orchestration layer that schedules, executes, and validates autonomy performance benchmarks across a fleet of physical bench-top systems — integrated into CI/CD pipelines as merge-blocking and release-blocking quality gates.
  • Platform Reliability & Observability: Build monitoring, alerting, and self-healing automation for the bench fleet. Proactively identify systemic risks — capacity bottlenecks, hardware degradation patterns, infrastructure single points of failure — before they become outages. Track utilization, failure rates, and capacity trends to ensure the platform scales ahead of organizational demand.
  • Performance Data Pipelines: Design and build end-to-end data pipelines that capture fine-grained performance metrics (CPU/GPU utilization, memory bandwidth, E2E latency, scheduling jitter) from bench-top runs, process them at scale, and surface actionable insights through dashboards and automated regression detection.
  • Statistical Analysis & Experimentation: Work with Data Science to develop rigorous experimentation methodology for performance results from non-deterministic autonomy workloads — including variance analysis, significance testing, and regression detection.
  • Bare-Metal & OS Platform: Guide the SRE team through the OS and system-level configuration of bench hardware — including Linux kernel tuning, boot infrastructure, networking, and hardware bring-up — ensuring the platform faithfully reproduces production robot compute behavior.
  • Drive Platform & Allocation Strategy: Own the planning lifecycle for the benchmarking fleet across hardware generations. Partner with engineering and program leadership to negotiate hardware allocation, model utilization scenarios under real-world constraints, and present data-backed trade-off recommendations — balancing testing coverage, user throughput, cost, and SLA commitments against finite physical resources.
  • Cross-Functional Collaboration: Partner with Hardware Engineering, NPI (New Product Introduction), SRE (Site Reliability Engineering), Perception, Behavior, and Data Science teams to translate their performance analysis needs into robust, self-service infrastructure.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service