About The Position

Want to impact the foundation for future AI storage development in Azure, the world's computer? The Azure Managed Lustre File System (AMLFS) team leads development, deployment, and monitoring of the most popular High-Performance Computing (HPC) parallel file system in the world: Lustre, the Azure storage solution of choice for AI training and fine-tuning. The AMLFS Platform Team is responsible for end-to-end delivery of AMLFS images, cluster deployment, logs and metrics, and configuration compliance. An ideal candidate will also have opportunities to impact cluster architecture and design of Lustre in the Azure ecosystem, performance analysis and optimization of AMLFS, and customer support for the most challenging parallel filesystem bugs or performance anomalies that arise within our product. As a Principal Software Engineer in the AMLFS Platform team you will lead design and development of key features, primarily working on reliable deployment of AMLFS in Azure, assessing and mitigating security risks, developing comprehensive unit and system-level tests, and diagnosing, mitigating, and fixing the most challenging deployment and upgrade customer issues. You will lead the design and development of logging, monitoring, and reporting capabilities for AMLFS and help define and measure key Service Level Indicators designed to make our product increasingly robust. This opportunity will allow you to develop expertise in distributed system and HPC/AI filesystem design, implementation, and debugging, grow proficient in navigating and managing Linux operating systems, and hone leadership qualities as you develop strong collaborative working relationships with with the core storage, compute, and networking teams that form the foundation of Azure. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python OR equivalent experience.

Responsibilities

  • Partners with appropriate stakeholders to determine user requirements for a set of scenarios.
  • Leads identification of dependencies and the development of design documents for a product, application, service, or platform.
  • Leads by example and mentors others to produce extensible and maintainable code used across products.
  • Leverages subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to drive multiple group's project plans, release plans, and work items.
  • Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service