The AI Platform organization builds the end-to-end Azure AI stack, from the infrastructure layer to the PaaS and user experience offerings for AI application builders, researchers, and major partner groups across Microsoft. The platform is core to Azure's innovation, differentiation and operational efficiency, as well as the AI-related capabilities of all of Microsoft's flagship products, from M365 and Teams to GitHub Copilot and Bing Copilot. We are the team building the Azure OpenAI service, AI Foundry, Azure ML Studio, Cognitive Services, and the global Azure infrastructure for managing the GPU and NPU capacity running the largest AI workloads on the planet. One of the major, mature offerings of AI Platform is Azure ML Services. It provides data scientists and developers a rich experience for defining, training, fine-tuning, deploying, monitoring, and consuming machine learning models. We provide the infrastructure and workload management capabilities powering Azure ML Services, and we engage directly with some of the major internal research and applied ML groups using these services, includingâ¯Microsoft Research and the Bing WebXT team. As part of AI Platform, the AI Infra team is looking for a Software Engineer II - AI Infrastructure (Scheduler) - CoreAI, with initial focus on the Scheduler subsystem. The scheduler is the "brains" of the AI Infra control plane. It governs access to the GPU and NPU capacity of the platform according to a complex system of workload preference rules, placement constraints, optimization objectives, and dynamically interacting policies aimed to maximize hardware utilization and fulfill greatly varying needs of users and the AI Platform partner services in terms of workload types, prioritization, and capacity targeting flexibility. The scheduler's set of capabilities is broad and ambitions. It manages quota, capacity reservations, SLA tiers, preemption, auto-scaling, and a wide range of configurable policies. Global scheduling is a distinctive major feature that overcomes the regional segmentation of the Azure compute fleet by treating the GPU capacity as a single global virtual pool, which greatly increases capacity availability and utilization for major classes of ML workload. We have achieved this capability without allowing a major global single point of failure, based on regional instances of the scheduler service interacting via peer-to-peer protocols for sharing capacity inventory and coordinating handoff of jobs for scheduling. Our system manages significant amount of GPU capacity even outside Azure datacenters, through a unified model and operational process and highly generalized, flexible workload scheduling capabilities. To be able to manage the inherent complexity of the Scheduler subsystem and enable it to meet the stringent expectations of high service reliability, availability, and throughput, we emphasize rigorous engineering, utmost precision and quality, and ownership-from feature design to livesite. Quality mindset, attention to detail, development process rigor, and data-driven design and problem-solving skills are key for success in our mission-critical control plane space.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Industry
Publishing Industries
Number of Employees
5,001-10,000 employees