About The Position

Microsoft Azure is at the core of the Microsoft Cloud, providing the backend infrastructure for large-scale distributed and dynamic computing. Our team delivers the software platform that enables internal Microsoft services such as Office 365, Bing.com, Xbox Live, Skype, and OneDrive, as well as external customers, to run mission-critical cloud applications for their businesses. We are seeking a Senior Software Engineer to help evolve and expand our software platform and infrastructure. Areas of focus include core infrastructure services at ring 0 and ring (-1), achieving five nines (99.999%) reliability, fault tolerance, distributed service monitoring, operational efficiency across the datacenter hardware lifecycle, performance metrics collection and analysis, alerting, visualization, device operations, and coordination of node diagnostics and repairs. This role offers the opportunity to work on highly strategic projects at massive scale, building robust distributed systems that form the backbone of the Microsoft Cloud. If you are passionate about designing and implementing solutions that drive reliability and efficiency, we would like to hear from you.

Requirements

  • Proven experience in software engineering, particularly in distributed systems.
  • Strong understanding of cloud infrastructure and services.
  • Experience with reliability engineering and fault tolerance.
  • Knowledge of performance metrics collection and analysis.
  • Ability to design and implement robust solutions for large-scale systems.

Nice To Haves

  • Experience with Microsoft Azure or similar cloud platforms.
  • Familiarity with operational efficiency practices in data centers.
  • Background in monitoring and alerting systems.

Responsibilities

  • Evolve and expand the software platform and infrastructure.
  • Focus on core infrastructure services at ring 0 and ring (-1).
  • Achieve five nines (99.999%) reliability and fault tolerance.
  • Implement distributed service monitoring and operational efficiency across the datacenter hardware lifecycle.
  • Collect and analyze performance metrics.
  • Develop alerting and visualization tools.
  • Manage device operations and coordinate node diagnostics and repairs.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service