About The Position

Azure High Performance Computing and Artificial Intelligence Platform group is the team behind Azure’s cloud offering that powers some of the most demanding and largest-scale Artificial Intelligence training and inference workloads in the industry. The virtual machine series that our team owns combines cutting-edge graphics processing units (GPUs) and accelerators, as well as a state-of-the-art scale-out network infrastructure to enable these workloads. We collaborate with many Microsoft teams and our industry partners to design and bring up the underlying platform, and we build the software to expose this platform as an Azure service. As a Senior Software Engineer in the Azure High Performance Computing and Artificial Intelligence team, you will play a critical role in both enhancing our current graphics processing unit virtual machine offerings as well as designing and delivering the next generations of our platform by solving technical problems at all levels of the stack, contributing to our codebases to enable new features on our virtual machines, and collaborating with our industry partners. This position involves deep technical work covering a broad range from hardware/software interactions, device virtualization, and performance analysis of graphics processing unit workloads in virtual machines. Since our team is also responsible for the vertical integration of our virtual machine offerings, you will also have the opportunity to work with upper layers of the Azure infrastructure software as well as directly engage with our high-volume internal customer teams to resolve issues they face on our current offering. It is an exciting time for the team as we are working on expanding the capacity and range of supported scenarios to fuel the next growth wave. This position offers a unique opportunity to have a huge impact on Microsoft’s Artificial Intelligence infrastructure and Artificial Intelligence initiatives. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees, we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Knowledge and understanding of backend networks

Responsibilities

  • Analyzes functionality, integration, and performance issues at various levels of the HW/SW stack and on current and future generations of AI training platforms.
  • Designs and codes solutions that improve functional correctness, stability and performance of AI training-oriented VM offerings and related services.
  • Optimizes, debugs, refactors, and reuses code to improve performance and maintainability, effectiveness, and return on investment (ROI). Applies metrics to drive the quality and stability of code, as well as appropriate coding patterns and best practices.
  • Holds accountability as a Designated Responsible Individual (DRI), working as on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Your mission will be to help ensure Azure platform is consistent on performance, can scale on demand, and engineered to withstand the unparalleled computing demand from the customer workloads. You will help build a test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality.
  • Embody our Culture and Values
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service