Senior Software Engineer

MicrosoftRedmond, WA

About The Position

Search is being transformed by AI – join us to build AI-powered search experiences used by hundreds of millions of people worldwide. The Core Search & AI team in Microsoft AI (MAI) develops and operates the foundational systems behind Search, Grounding, and Agentic Search. You will work on large-scale systems that combine web-scale retrieval, advanced ranking models, and real-time inference to deliver relevant, trustworthy, and high-quality AI experiences. As a Senior Software Engineer in the Core Search & AI team, you will build and operate next-generation AI infrastructure for Search, Grounding, and Agentic Search. You will develop scalable systems for distributed data pipelines, LLM and SLM training (including SFT and RL), high-throughput inference, evaluation frameworks, and observability. You will collaborate with engineering, research, and product teams to deliver reliable, measurable, and high-performing AI solutions, and use data to guide technical decisions, investigate issues, and improve live-site quality. This opportunity will allow you to deepen your expertise in distributed AI infrastructure, gain experience with production-scale AI workloads, and expand your ownership of end-to-end service quality and operational excellence. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Starting January 26, 2026, Microsoft AI (MAI) employees who live within a 50- mile commute of a designated Microsoft office in the U.S. or 25-mile commute of a non-U.S., country-specific location are expected to work from the office at least four days per week. This expectation is subject to local law and may vary by jurisdiction.

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Nice To Haves

  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Experience building high throughput, low latency distributed systems and applications at scale.
  • Experience designing and operating data processing, training workflows and inference systems for LLM/SLM using Azure Machine Learning.
  • Experience optimizing GPU-based serving workloads for performance, efficiency and cost.
  • Experience with machine learning fundamentals.

Responsibilities

  • Collaborates with appropriate stakeholders to define user requirements for a scenario and incorporates stakeholder insights into system design.
  • Drives identification of dependencies and the development of design documents for a product or service with little oversight.
  • Builds, reviews, and maintains high-quality, secure, and performant code, applying best practices in reliability, testability, and maintainability, and using telemetry and debugging tools to validate assumptions and prevent issues before production.
  • Leverages subject-matter expertise of product features and partners with appropriate stakeholders to drive a workgroup's project plans, release plans, and work items.
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate.
  • Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service