Principal Engineer, NPU Architect

Renesas ElectronicsAustin, TX
12dHybrid

About The Position

We are looking for a Principal NPU Hardware Architect with 10 to 15 years of experience to drive the architectural definition and hardware implementation of high-performance Neural Processing Units (NPUs) targeted for microcontrollers and microprocessors addressing Automotive high performance compute. This is a hardware-oriented role that requires a deep understanding of the full silicon lifecycle, combined with a strong background in hardware-software co-design to ensure the NPU architecture is highly optimized for compiler-driven execution and software stacks.

Requirements

  • 12+ years in AI accelerator, NPU, or GPU hardware architecture and RTL design.
  • Deep knowledge of deep learning primitives (CNNs, Transformers, RNNs) and how they map to spatial compute hardware.
  • Strong understanding of compiler backends (e.g., LLVM, MLIR), IR transformations, and how hardware features like scratchpad memories or tiling impact compiler efficiency.
  • Proven track record with modern SoC protocols (AXI/ACE/CHI) and integrating NPU cores into larger system-on-chip environments.
  • Expert-level proficiency in SystemC/TLM or C++ for architectural performance modeling and hardware-software co-verification.
  • Ability to act as a technical authority, mentoring junior designers and influencing cross-functional roadmaps.

Nice To Haves

  • Bachelor’s or Master’s in Electrical Engineering or Computer Engineering (PhD desirable)

Responsibilities

  • NPU Architecture & Dataflow: Define and own the end-to-end NPU micro-architecture, including high-throughput tensor/matrix engines, vector units, and specialized activation functional units.
  • Hardware-Software Co-Design: Partner closely with compiler and software teams to define instruction sets (ISA), memory management schemes, and hardware-aware graph optimizations.
  • Virtualization & Multi-Tenancy: Architect hardware-assisted virtualization features to enable secure resource sharing and multi-tenant execution in cloud or edge environments.
  • Interconnect & Fabric: Design and integrate high-bandwidth Bus fabrics (e.g., NoC, CHI) and DMA controllers optimized for the massive data movement inherent in AI workloads.
  • Infrastructure & Power: Lead the definition of SoC infrastructure elements, including complex clock/reset domains and advanced power management strategies to maximize performance-per-watt.
  • Performance Modeling: Develop bit-accurate and cycle-accurate C++/SystemC models to validate architectural choices and enable early software development.
  • Full Design Flow: Oversee the transition from architectural spec to RTL, providing technical leadership through verification, physical design, and post-silicon bring-up.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service