Software Engineer II

Microsoft•Redmond, WA

About The Position

The MAIA System Infrastructure team is building the foundational software layers that power Microsoft’s custom AI accelerator, the MAIA chip, across both training and inference workloads. Our mission is to define the runtime and dataflow infrastructure that allows models and systems to scale seamlessly across racks of MAIA devices. We are designing next-generation systems that redefine how data is moved, synchronized, and orchestrated across accelerators over PCIe and other high-speed interconnects. As part of this team, you will help build and evolve the core runtime infrastructure that sits at the boundary of hardware and software - responsible for powering massive model execution at cloud scale. Our work spans device-driver communication, runtime coordination, and low-level scheduling - all optimized for latency, bandwidth, and throughput. We work hand-in-hand with hardware teams, compiler and model teams, and observability partners to ensure every byte moved is intentional and efficient. This team is a unique opportunity to shape the AI infrastructure layer from the ground up, joining a collaborative, systems-focused group that thrives at the intersection of hardware design, low-level systems software, and scalable cloud AI deployment.

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, or Python OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice To Haves

Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, or Python OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, or Python OR equivalent experience.
3+ years of experience in systems programming (C, C++, Rust, or similar), with a focus on low-level or performance-critical software.
Solid understanding of memory models, concurrency, and interprocess communication.
Ability to reason about performance tradeoffs, including latency vs. bandwidth, queuing models, and batching vs. streaming.
Proven problem-solving skills with the ability to tackle complex technical challenges related to data flow efficiency and infrastructure optimization.
A track record of working on high-impact projects, demonstrating a passion for building robust, high-performance systems.
Excellent collaboration and communication skills, with a drive to work alongside top-tier engineers to push the boundaries of AI acceleration tooling.
Experience working on infrastructure involving hardware interfaces or device communication (e.g., PCIe, DMA, RDMA, or similar).
Familiarity with GPU, TPU, or other accelerator architectures and their runtime systems.
Experience implementing communication protocols or working with driver/kernel interfaces.
Exposure to observability or profiling tools (e.g., eBPF, trace buffers, performance counters, telemetry hooks).
Strong cross-discipline collaboration skills—hardware/software codesign, or coordination with test and validation teams.

Responsibilities

Actively contribute to a culture of inclusivity by valuing diverse perspectives, mentoring peers, and promoting open communication. Support and uplift teammates to ensure everyone can contribute their best in a high-performing, collaborative environment.
Design and implement core components of the MAIA runtime, including:
PCIe-based communication protocols
Data movement orchestration between host and device memory
Command encoding and dispatch mechanisms for AI workloads
Synchronization and stream control primitives across devices and execution units
Collaborate with hardware, firmware, and compiler teams to define and refine the runtime contract between layers.
Optimize performance-critical code paths to minimize latency and maximize throughput, particularly around memory copies, kernel launch sequencing, and queue management.
Contribute to tooling and test infrastructure that enables validation, tracing, and performance benchmarking of your components.
Participate in code reviews, design reviews, and cross-team architecture discussions.
Drive high-quality implementation practices: testing, documentation, and debugging support.