Principal Software Engineer, CoreAI

MicrosoftRedmond, WA
8h

About The Position

The GenAI Infrastructure and Solutions team is building large-scale GenAI training infrastructure, LLM-based solutions and tools. We provide the infrastructure for teams in CoreAI and other Microsoft Groups to fine-tune LLMs and serve agentic workload for their own scenarios. As a Principal Software Engineer, you will work on the infrastructure and tools to support large scale model fine-tuning, evaluation, and inference. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

Requirements

  • Bachelor's Degree in Computer Science or related technical field and 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python or equivalent experience.

Nice To Haves

  • 6+ years designing, developing, and shipping high quality software.
  • 3+ years of experience with distributed systems and cloud-based infrastructure.
  • 2+ years of experience with containerization tools (e.g., Docker, Kubernetes).
  • 2+ years of experience with DevOps practices (CI/CD, automated testing, deployment, etc.).
  • Passionate and self-motivated. Strong ability in self-learning, entering new domain, managing through uncertainty in an innovative team environment.
  • Familiarity with virtualization technology.
  • Familiarity with production ML systems and concepts like model serving, caching, batching, and monitoring.

Responsibilities

  • Lead the collaboration with engineers and researchers to build and optimize training infrastructure and tools for LLMs, SLMs, multimodal, and code-specific models.
  • Design, build and improve services with high scalability and reliability.
  • Design and implement the services to serve the prod traffic and fulfill the security and privacy requirements.
  • Lead the efforts to deliver and improve engineering systems and practices to ensure service quality in complex cloud environments.
  • Contribute to the deployment and monitoring of services in production environments.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service