About The Position

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team is seeking a passionate Software Engineer 2 to help build, operate, and support hyperscale cloud infrastructure for some of the world’s largest supercomputing deployments. You’ll work alongside experienced engineers to develop, monitor, and troubleshoot cloud-native supercomputing systems, contributing to the reliability and performance of Azure’s AI infrastructure offerings. At the supercomputing scale, we need specialized tools and techniques to maintain the availability, reliability, runtime performance and health of the system to meet the Service Level Agreements (SLAs) of customers. Your job would be to build and use state-of-the-art cloud applications and services to monitor the health of the supercomputers, find operational gaps and instrument features to achieve the smooth management of cloud-native supercomputers. As a Supercomputing Software Engineer, you would also bring to the table best practices driving architectural changes and influence roadmap of relevant software and hardware components. Your work will directly impact the business goals of a wide range of users and facilitate the next wave of growth and innovation in AI in the cloud. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

  • Bachelor’s Degree in Computer Science or related technical field AND 4+ years technical engineering experience OR Master’s degree in Computer Science or related technical field AND 3+ years technical engineering experience.
  • Experience with monitoring, profiling, or debugging distributed systems or cloud applications.
  • Familiarity with AI/HPC workloads, GPU-based systems, AI assisted software development and secure software design practices.
  • Familiarity with IaaS operating model and SLA commitments.

Responsibilities

  • Be proactive and innovative about adding new metrics for monitoring the health of the supercomputers.
  • Collaborate with team members and stakeholders to understand requirements and produce detailed, data-driven, collaborative design for assigned features.
  • Independently uses appropriate artificial intelligence tools and practices across the software development lifecycle to develop, test, debug, and maintain code for Supercomputer health monitoring systems.
  • Remain current in skills by investing time and effort into staying abreast of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.
  • Act as a Designated Responsible Individual (DRI) working on-call to monitor system/product feature/service for degradation, downtime, or interruptions and gain approval to restore system/product/service for simple problems.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service