Senior/Staff Machine Learning Systems Engineer

AbridgeSan Francisco, CA
118d

About The Position

As an AI Infrastructure Engineer at Abridge, you'll play a pivotal role in building and optimizing the core infrastructure that powers our machine learning models. Your work will be instrumental in enhancing the scalability, efficiency, and performance of our AI-driven solutions. You will work with our Infrastructure and Research teams to build, deploy, optimize and orchestrate across our AI models.

Requirements

  • Strong experience in building and deploying machine learning models in production environments
  • Deep understanding of container orchestration and distributed systems architecture
  • Expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Excellent communication skills, with the ability to interface between research and product engineering

Nice To Haves

  • Expertise with model serving frameworks such as NVIDIA Triton Server, VLLM, TRT-LLM and so on
  • Expertise with ML toolchains such as PyTorch, Tensorflow or distributed training and inference libraries
  • Familiarity with GPU cluster management and CUDA optimization
  • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
  • Experience with container registries, image optimization, and multi-stage builds for ML workloads
  • Experience orchestrating across ASR models or LLM models for building various GenAI applications

Responsibilities

  • Design, deploy and maintain scalable Kubernetes clusters for AI model inference and training
  • Develop, optimize, and maintain ML model serving and training infrastructure, ensuring high-performance and low-latency
  • Collaborate with ML and product teams to scale backend infrastructure for AI-driven products, focusing on model deployment, throughput optimization, and compute efficiency
  • Optimize compute-heavy workflows and enhance GPU utilization for ML workloads
  • Build a robust model API orchestration system
  • Collaborate with leadership to define and implement strategies for scaling infrastructure as the company grows, ensuring long-term efficiency and performance

Benefits

  • Generous Time Off: 13 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
  • Comprehensive Health Plans: Medical, Dental, and Vision plans for all full-time employees. Abridge covers 100% of the premium for you and 75% for dependents. If you choose a HSA-eligible plan, Abridge also makes monthly contributions to your HSA
  • Paid Parental Leave: 16 weeks paid parental leave for all full-time employees
  • 401k and Matching: Contribution matching to help invest in your future
  • Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
  • Learning and Development Budget: Yearly contributions for coaching, courses, workshops, conferences, and more
  • Sabbatical Leave: 30 days of paid Sabbatical Leave after 5 years of employment
  • Compensation and Equity: Competitive compensation and equity grants for full time employees
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service