About The Position

Neuron Containers connects the Neuron SDK to compute platforms. The team enables customers to run training and inference workloads on Neuron at scale — with reliable device allocation, fault tolerance, auto-scaling, and native observability. The team owns Neuron integration with Kubernetes, ECS, and Slurm.

Requirements

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Experience as a mentor, tech lead or leading an engineering team
  • 7+ years of software development experience
  • 4+ years of distributed systems experience, or Bachelor's degree in engineering, technology, computer science, machine learning, robotics, operations research, statistics, mathematics or equivalent quantitative field
  • Experience in large-scale IT deployment or programming
  • Experience leading engineering design projects and interacting with cross-functional teams
  • Proficiency in Go or similar systems languages

Nice To Haves

  • Bachelor's degree in computer science or equivalent
  • Experience with Kubernetes architecture device plugins, schedulers, controllers, DRA drivers
  • Hands-on experience with Helm, Prometheus, Kubernetes operator frameworks
  • Familiarity with ML training/inference infrastructure and container image pipelines
  • Experience with AWS compute services (EC2, EKS, ECS, ECR)
  • Exposure to Deep Learning Containers or Deep Learning AMIs

Responsibilities

  • Lead multi-person projects end-to-end — from design documentation and architecture reviews through to delivery
  • Design container platform integrations — device plugins, DRA drivers, and operator development for ML accelerator resource management
  • Solve scalability challenges — diagnose performance issues across thousand-node customer clusters
  • Simplify systems — deprecate legacy software and reduce complexity in container delivery pipelines
  • Drive operational excellence — own on-call responsibilities, proactively triage test failures, and drive ticket resolution
  • Elevate team quality — deliver insightful code reviews and set the standard through your own contributions
  • Build consensus — align stakeholders on technical direction when solutions are ambiguous or views are discordant

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service