About The Position

As a Principal Software Engineering Manager - AI Frameworks on the team, you will lead and grow a group of engineers working across multiple layers of the AI software serving stack, including fundamental abstractions, runtimes, libraries, and application programming interfaces (APIs). You will be responsible for setting technical direction, prioritizing investments, and ensuring the team delivers high-impact performance improvements that enable large-scale model training and inference. In this role, you will guide the team’s work on benchmarking OpenAI and other large language models (LLMs) across GPUs and Microsoft hardware, driving performance optimization, monitoring regressions, and accelerating time-to-deployment. You will partner closely with researchers, product teams, and platform owners to translate performance insights into production-ready improvements that reduce hardware footprint and support Microsoft Azure’s capex efficiency goals.

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong technical foundation in software engineering principles, computer architecture, GPU architecture, and hardware acceleration for neural networks, with the ability to guide teams working in these areas.
  • Experience leading teams responsible for end-to-end performance analysis and optimization of LLMs, AI systems, or HPC workloads, including use of GPU profiling and performance analysis tools.
  • Demonstrated ability to lead cross-team initiatives, align stakeholders, and translate research or platform capabilities into scalable, production-ready solutions.
  • Proven people leadership skills, including hiring, coaching, performance management, and career development, with a track record of building high-performing, inclusive teams.
  • Exposure to AI / ML infrastructure, including DNN or LLM training and/or inference systems, and experience with at least one modern deep learning framework (e.g., PyTorch, TensorFlow, ONNX Runtime).
  • Familiarity with GPU software stacks and acceleration technologies such as CUDA, ROCm, Triton, or equivalent, sufficient to guide technical direction and evaluate tradeoffs.

Nice To Haves

  • Master’s Degree in Computer Science or related technical field AND 10+ years of software engineering experience, including 6+ years in engineering management, OR Bachelor’s Degree in Computer Science or related technical field AND 12+ years of software engineering experience, including 6+ years in engineering management, or equivalent experience.

Responsibilities

  • Lead and develop a team of engineers working across multiple layers of the AI software stack to enable large-scale training and inference.
  • Set technical vision and execution strategy for model performance benchmarking, optimization, and deployment across GPUs and Microsoft hardware.
  • Drive performance outcomes by prioritizing and overseeing efforts to benchmark, profile, debug, and optimize training and inference workloads.
  • Own performance health by establishing mechanisms to monitor regressions, measure impact, and continuously improve time-to-deploy and hardware efficiency.
  • Partner cross-functionally with research, product, infrastructure, and hardware teams to deliver scalable, production-ready AI performance improvements.
  • Balance short-term delivery and long-term investments , ensuring the team’s work aligns with organizational goals, platform roadmaps, and Azure capex objectives.
  • Build a strong engineering culture through coaching, feedback, hiring, and career development, enabling the team to operate with increasing autonomy and impact.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service