About The Position

As the Software Development Engineer for the Neuron Runtime Team, you will be responsible for working alongside a team of engineers to develop and maintain high-performance runtime libraries and drivers for machine learning applications and AI accelerators. You will work on design, development, and deployment of Neuron Runtime and other Neuron components. The profiler plays a crucial role to internal and external customers in optimizing AI workloads across hardware platforms such as Trainium and Inferentia devices, by providing deep insights into performance bottlenecks and system behavior. Improving performance of ML Kernels and ML Frameworks. In this role, you will manage the full development life cycle of the Neuron Runtime, ensuring scalability, reliability, and usability. You will collaborate with cross-functional teams to ensure that the our C++ compiler generates key information so customers can understand and optimize the performance of our custom hardware. Additionally, you will drive innovations that allow the profiler to support multiple frameworks, such as PyTorch, JAX, and XLA. A successful candidate will have experience in architecting, building, and operating distributed systems with a focus on high availability and fault tolerance, Hands-on experience with AWS services (e.g., EC2, ECS, CloudWatch, S3, Lambda) in production environments and track record in Owning services end-to-end including deployment, monitoring, alarming, on-call, and post-incident review. You will work with the executive leadership and other senior management and technical leaders to define product directions and deliver them to customers. We build massive-scale distributed training and inference solutions. This organization builds the full stack of software, servers and chips to accelerate at the highest scale.

Requirements

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • Experience in architecting, building, and operating distributed systems with a focus on high availability and fault tolerance
  • Hands-on experience with AWS services (e.g., EC2, ECS, CloudWatch, S3, Lambda) in production environments and track record in Owning services end-to-end including deployment, monitoring, alarming, on-call, and post-incident review.

Nice To Haves

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

Responsibilities

  • Develop and maintain high-performance runtime libraries and drivers for machine learning applications and AI accelerators.
  • Work on design, development, and deployment of Neuron Runtime and other Neuron components.
  • Manage the full development life cycle of the Neuron Runtime, ensuring scalability, reliability, and usability.
  • Collaborate with cross-functional teams to ensure that the our C++ compiler generates key information so customers can understand and optimize the performance of our custom hardware.
  • Drive innovations that allow the profiler to support multiple frameworks, such as PyTorch, JAX, and XLA.

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service