Senior Software Engineer, CoreAI

MicrosoftRedmond, WA
4d

About The Position

Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models. Collaborate closely with engineers, researchers and external partners to debug, diagnose, and improve stability of large-scale training runs. Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds. Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices.

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • These requirements include but are not limited to the following specialized security screenings: 3+ years designing, developing, and shipping high quality software.
  • 2+ years of experience with distributed systems and cloud-based infrastructure.
  • 1+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.).
  • 4+ years of software development experience in C#, C++, Python, or similar languages.
  • 2+ years of experience with containerization tools (e.g., Docker, Kubernetes).
  • Knowledge and hands on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.

Responsibilities

  • Architect, design, and develop core AI Infrastructure services
  • Collaborate with engineers, researchers and external partners to debug, diagnose, and improve stability of large-scale training runs.
  • Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments
  • Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service