About The Position

Azure Specialized collaboratively works to bring the next generation of workloads to our Public Cloud platform, enabling end-to-end new scenarios for Azure customers. The team imagines and builds differentiating customer features and fundamental building blocks at the heart of the Azure platform, working collaboratively with many industry partners. This is a highly impactful team with robust growth opportunities, focusing on AI infrastructure, Cloud services, and Security. As a SRE II in Azure Specialized, you will gain valuable experience in service architecture, datacenter networking, monitoring, and security, as well as working with partner teams. A primary focus is designing, developing, deploying, managing, and monitoring various product features and infrastructure, which will allow you to develop backend infrastructure supporting diverse services. The work for this position will cross many layers of Azure Services, presenting unique engineering challenges and offering great opportunities to work with many partner teams and gain broad exposure to control plane and data plane technologies end-to-end. Microsoft’s mission is to empower every person and every organization on the planet to achieve more, fostering a culture of inclusion built on values of respect, integrity, and accountability.

Requirements

  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
  • 1+ years experience managing physical infrastructure.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
  • 2+ years technical experience working with large-scale cloud or distributed systems.
  • 1+ year(s) people management experience.
  • Experience working on large-scale distributed services with on-call responsibilities.
  • Ability to build and influence broadly towards common goals and priorities.
  • Ownership of end-to-end project lifecycle with solid project management and communication skills.
  • Experience with managing physical infrastructure, supporting GPUs and InfiniBand.

Responsibilities

  • Contributes to efforts to collect, classify, and analyze data with little oversight on a range of metrics (e.g., health of the system, where bugs might be occurring).
  • Contributes to the refinement of product features by escalating findings from analyses to inform decisions regarding the engineering of products.
  • Contributes to the development of automation within production and deployment of a complex product feature.
  • Runs code in simulated, or other non-production environments to confirm functionality and error-free runtime for products with little to no oversight.
  • Contributes to efforts to ensure the correct processes are followed to achieve a high degree of security, privacy, safety, and accessibility.
  • Checks for visible evidence to demonstrate compliance for product areas.
  • Develops and holds an understanding of the implications of onboarding new technologies following expectations of compliance at Microsoft.
  • Remains current in skills by investing time and effort into staying abreast of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.
  • Applies best practices to reliably build code that is based on well-established methods.
  • Follows best practices for product development and scaling to customer requirements and applies best practices for meeting scaling needs and performance expectations.
  • Maintains communication with key partners across the Microsoft ecosystem of engineers.
  • Considers partners across teams and their end goals for products to drive and achieve desirable user experiences and fitting the dynamic needs of partners/customers through product development.
  • Maintains operations of live service as issues arise on a rotational, on-call basis.
  • Implements solutions and mitigations to more complex issues impacting performance or functionality of Live Site service and escalates as necessary.
  • Reviews and writes issues postmortem and shares insights with the team.
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions. Alerts stakeholders as to status and initiates actions to restore system/product/service for simple problems and complex problems when appropriate. Responds within Service Level Agreement (SLA) timeframe. Drives efforts to reduce incident volume, looking globally at incidences and providing broad resolutions. Escalates issues to appropriate owners.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service