Software Engineer

Microsoft
16h

About The Position

The Azure High Performance Computing and Artificial Intelligence Platform team is responsible for Microsoft Azure’s cloud infrastructure that supports some of the most demanding and large-scale artificial intelligence training and inference workloads in the industry. The virtual machine series managed by this team integrates advanced graphics processing units (GPUs) and accelerators with a scale-out network infrastructure to enable these workloads. The team collaborates with internal Microsoft groups and industry partners to design the underlying platform and build the software that exposes it as an Azure service. As a Software Engineer in the Azure High Performance Computing and Artificial Intelligence team, you will play a key role in enhancing current GPU-based virtual machine offerings and designing future generations of the platform. You will contribute across the stack—solving technical challenges, enabling new features, and working closely with industry partners to deliver scalable and reliable solutions. This role involves deep technical engagement, including hardware and software interactions, device virtualization, and performance analysis of GPU workloads in virtual machines. Because the team is responsible for vertical integration of virtual machine offerings, you will also work with upper layers of Azure infrastructure and engage directly with high-volume internal customer teams to resolve issues and improve service quality. With the team actively expanding capacity and supported scenarios, this is a unique opportunity to make a significant impact on Microsoft’s artificial intelligence infrastructure and initiatives. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

  • Bachelor's Degree in Computer Science, or related technical discipline with proven experience coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

  • Bachelor's Degree in Computer Science OR related technical field AND 1+ year(s) technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python OR Master's Degree in Computer Science or related technical field with proven experience coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • 1+ year(s) with Machine Learning, AI Infrastructure, Operating Systems fundamentals and virtualization technologies
  • 1+ year(s) experience in analyzing and troubleshooting large-scale distributed systems.
  • 1+ year(s) experience on High Performance Computing / Machine Learning middleware

Responsibilities

  • Analyzes functionality, integration, and performance issues at various levels of the HW/SW stack and on current and future generations of AI training platforms.
  • Designs and codes solutions that improve functional correctness, stability and performance of AI training oriented VM offerings and related services.
  • Optimizes, debugs, refactors, and reuses code to improve performance and maintainability, effectiveness, and return on investment (ROI). Applies metrics to drive the quality and stability of code, as well as appropriate coding patterns and best practices.
  • Holds accountability as a Designated Responsible Individual (DRI), working as on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Help ensure Azure platform is consistent on performance, can scale on-demand, and engineered to withstand the unparalleled computing demand from the customer workloads.
  • Help build a test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service