Senior ML Software Engineer - Quantization & Numerics

Microsoft Corporation•Mountain View, CA

75d

About The Position

Do you want to be at the forefront of innovating the latest hardware designs to propel Microsoft's cloud growth? Are you seeking a unique career opportunity that combines technical capabilities, cross team collaboration, with business insight and strategy? Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees, we come together with a growth mindset, innovate to empower others, and collaborate to achieve our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day. Join the Strategic Planning and Architectureâ (SPARC) team within Microsoft's Azure Hardware Systems and Infrastructure (AHSI) organization, the team behind Microsoft's expanding Cloud Infrastructure and for powering Microsoft's "Intelligent Cloud" mission.â Microsoft delivers more than 200 online services to more than one billion individuals worldwide and AHSI is the team behind our expanding cloud infrastructure. We deliver the core infrastructure and foundational technologies for Microsoft's cloud businesses including Microsoft Azure, Bing, MSN, Office 365, OneDrive, Skype, Teams and Xbox Live.

Requirements

Bachelor's Degree in Computer Science, Electrical or Computer Engineering, or related field AND 4+ years of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development OR Master's Degree in Computer Science, Electrical or Computer Engineering, or related field AND 3+ years of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development OR Doctorate in Computer Science, Electrical or Computer Engineering, or related field AND 1+ year(s) of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

Demonstrated experience delivering production-grade software in areas such as model compression, low-precision numerics (FP8, INT8/4, NVFP4, MX formats, etc.), low-level kernel development, and performance optimization.
Proficiency with modern deep learning frameworks, including PyTorch, TensorFlow, TensorRT, and ONNX Runtime.
Expertise in GPU/NPU kernel development using CUDA, Triton, ROCm, or comparable frameworks and fast model bring up on a new stack
Strong understanding of Transformer and LLM architectures, with hands-on experience in optimization techniques such as quantization, pruning, tensor/parameter sharding, model parallelism, KV-cache optimization, and Flash Attention etc.
Practical experience with large-scale model evaluation, including benchmarking state-of-the-art LLMs and fine-tuning (SFT or RL) large models.
Solid programming skills in Python, C, and C++.
Excellent communication abilities and a proven capacity to collaborate effectively in hybrid team-oriented environments.
Hands-on experience implementing and optimizing low-level linear algebra routines, including custom BLAS kernels would be a plus.
Deep knowledge of mixed-precision arithmetic units, including numerical formats and microarchitecture, is highly desirable.

Responsibilities

Design and develop novel quantization and numerics kernels to enable efficient deployment of LLM inference and training in Microsoft's Azure production environments.
Drive software development and model optimization tooling proof-of-concept effort to streamline deployment of quantized models.
Analyze performance bottlenecks in quantized state-of-the-art LLM architectures and drive performance improvements.
Prototype and evaluate emerging low-precision data formats through proof-of-concept implementations on novel hardware accelerator SDK.
Co-design model architecture optimized for low-precision deployment in close collaboration with companywide AI/ML teams.
Work cross-functionally with data scientists and ML researchers/engineers across organizations to align on model accuracy and performance goals.
Partner with hardware architecture and AI software framework teams to ensure end-to-end system efficiency.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume