ML Ops Engineer

TennrNew York, NY
1dOnsite

About The Position

As the first ML Ops Engineer at Tennr, you’ll play a crucial role in building and iterating on foundational Machine Learning and AI systems. You’ll own building machine learning training and inference pipelines that can handle increasing traffic demands and proliferation of product surface as we grow. You will be critical in ensuring our AI-driven healthcare platform is powered by robust, scalable, and efficiently deployed models. Our Machine Learning team owns and develops multiple in-house, proprietary VLMs, LLMs, and other models that are purpose built for the ambitious problems we are solving in the healthcare space. This is not a role where you are repackaging and wrapping old innovations, but an opportunity to be on the cutting edge of experimentation and productization of net new capabilities. You’ll make impactful contributions and influence fundamental elements of our ML and data systems, expanding Tennr’s ability to rapidly iterate and solve critical problems for patients and providers.

Requirements

  • 5+ years of experience in ML model deployment, infrastructure, and scaling in production environments
  • Strong software engineering fundamentals, with proficiency in Python and TypeScript
  • Experience in software design and architecture for highly available ML systems for use cases like inference, evaluation, and experimentation
  • Strong knowledge of observability, including logging, metrics, tracing, model performance monitoring, and alerting
  • Experience with distributed systems, reliability, and production incident response
  • Comfortable working in ambiguity with high ownership, moving quickly in a fast-paced startup environment, and proactively driving projects from idea to production

Nice To Haves

  • Experience working with ML CI/CD and common ML frameworks like Pytorch, Tensorflow, etc.
  • Experience working with common inference frameworks like vLLM, TensorRT, Triton, etc
  • Experience with GPU orchestration, including managing GPU workloads/scheduling, cost management, cluster utilization, etc
  • Experience with GPU optimization (training/inference) involving CUDA profiling, memory optimization, multi-GPU communication, etc

Responsibilities

  • Architect, design, and implement ML software systems for deploying and managing models at scale.
  • Develop and maintain infrastructure that supports efficient ML operations, including data pipelines, model evaluations, deployments, and training at scale.
  • Collaborate closely with ML engineers, software engineers, and cross-functional teams to ensure seamless integration of models with data pipelines and products.
  • Troubleshoot production issues and continuously improve systems to enhance performance and efficiency.
  • Create tooling for online and offline evaluation of ML & LLM systems.

Benefits

  • Chelsea office
  • Unlimited PTO
  • 100% paid employee health benefit options
  • Employer-funded 401(k) match
  • Competitive parental leave
  • Free lunch! Plus a pantry full of snacks.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service