About The Position

NVIDIA provides the platform for all new AI-powered applications. We are hiring a strategic, technically proficient Senior Manager of Software Engineering to head the NVIDIA Inference Microservices (NIM) Factory group. You will direct and grow an extraordinary team of managers and senior technical leaders. The mission is ambitious: make NIM the global benchmark for distributing and running AI inference. NVIDIA Inference Microservices (NIM) embody NVIDIA’s vision for delivering AI inference as production-ready, optimized, and enterprise-supported microservices. These services help customers transition from experimentation to real-world impact across cloud, data center, and edge environments. Included in NVIDIA AI Enterprise, NIM bundles models, optimized runtimes, and validated configurations into standardized containers. This approach reduces deployment friction and sets a new operational standard for inference on NVIDIA GPUs. You will help determine how AI is consumed globally in this role! Your organization launches day-0 models and supports them with enterprise-grade software. This software is crafted to enable developers to access high-performance, pioneering models instantly. Close partnership with executive leadership in product, research, SRE, and security is part of your responsibilities. You will outline the strategy, oversee progress across various complex efforts, and maintain the platform’s long-term technical integrity.

Requirements

  • 15+ years of experience building and delivering production software systems. This includes 8+ years in engineering management and 3+ years managing managers at the Director level or equivalent.
  • Proven track record of leading large engineering organizations (50+ engineers) and driving complex, multi-functional programs from inception to successful production launch and scale.
  • Deep technical understanding of cloud‑native engineering (containers, Kubernetes, microservices) and modern SDLC practices; ability to dive deep into architecture and code when necessary.
  • Strong critical thinking and business insight; ability to translate high-level business goals into actionable engineering strategies.
  • Excellent communication and collaborator management; ability to influence executive leadership across product, research, security, and operations.
  • A degree in Computer Science, Computer Engineering, or a related field (BS or MS) or equivalent experience.

Nice To Haves

  • Open Source Leadership: Significant contributions to or leadership of major open source projects in the AI/ML or cloud-native landscape (e.g., CNCF projects, Hugging Face ecosystem).
  • Led organizations that built and operated large‑scale LLM inference or model‑serving platforms (Triton, TensorRT‑LLM, vLLM, KServe) in production.
  • Experience architecting next-generation container build systems or CI/CD platforms at extensive scale.
  • Built and managed globally distributed organizations; established durable engineering processes that significantly improved quality and velocity across multiple teams.
  • Recognized industry leader with contributions to open‑source ecosystems, technical publications, or talks in containers, Kubernetes, GPU, or inference communities.

Responsibilities

  • Set the Strategy: Define the multi‑year technical strategy and roadmap to establish NIM as the universal runtime and distribution standard for AI inference. Ensure we can build, ship, and operate NIMs at exponential scale while setting the industry bar for ease of use and performance.
  • Manage the Portfolio: Develop the operating model for NIM Factory—OKRs, roadmap governance, technical standards, and cross-org dependency coordination—to achieve consistent results across various complex workstreams.
  • Lead Leaders: Guide the NIM Factory engineering organization (managing managers, senior managers, and senior technical leaders) across containers, orchestration, workflow, observability, and platform APIs. Develop organizational structure, succession plans, and leadership skills as the organization expands.
  • Accelerate “Day‑0” or equivalent experience Model Delivery: Manage the factory systems and platform capabilities to package and optimize the latest modern models right away. This bridges powerful research and enterprise production demands.
  • Establish the Operational Standard: Define and guide the full platform standards that transform models, optimized runtimes, and validated configurations into standardized, enterprise-supported containers. Customers can deploy these containers across cloud, data center, and edge environments.
  • Own Reliability, Security, and Compliance Outcomes: Partner with SRE and security leadership to set SLOs, establish durable incident/postmortem and release practices, and ensure NIMs meet enterprise expectations for availability, performance, and supply‑chain rigor.
  • Champion Open Source and Standards: Lead our upstream strategy and partnerships to build standards in containerization, orchestration, and inference. Ensure NVIDIA contributes meaningfully and is an outstanding citizen in the ecosystem.
  • Own cost efficiency, capacity strategy, and operational health for the global factory. Ensure we invest in the right capabilities and remove bottlenecks to growth.

Benefits

  • You will also be eligible for equity and benefits.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service