Senior Principal MLOps Engineer, AI Inference

Red HatBoston, MA
75d$189,600 - $312,730

About The Position

At Red Hat we believe the future of AI is open and we are on a mission to bring the power of open-source LLMs and vLLM to every enterprise. Red Hat Inference team accelerates AI for the enterprise and brings operational simplicity to GenAI deployments. As leading developers, maintainers of the vLLM project, and inventors of state-of-the-art techniques for model compression, our team provides a stable platform for enterprises to build, optimize, and scale LLM deployments. We are seeking an experienced ML Ops engineer to work closely with our product and research teams to scale SOTA deep learning products and software. As an ML Ops engineer, you will work closely with our technical and research teams to manage training and deployment pipelines, create DevOps and CI/CD infrastructure, and scale our current technology stack. If you are someone who wants to contribute to solving challenging technical problems at the forefront of deep learning, this is the role for you! In this role, your primary responsibility will be to build and release the Red Hat AI Inference runtimes, continuously improve the processes and tooling used by the DevOps team, and find opportunities to automate procedures and tasks. Join us in shaping the future of AI!

Requirements

  • 2+ years of experience in MLOps, DevOps, Automation and modern Software Deployment practices
  • Strong experience with Git, Github Actions including self-hosted runners, Terraform, Jenkins, Ansible, and common technologies for automation and monitoring
  • Highly experienced with administering Kubernetes/Openshift
  • Familiar with Agile development methodology
  • Experience with Cloud Computing using at least one of the following Cloud infrastructures: AWS, GCP, Azure, or IBM Cloud
  • Solid programming skills especially in Python
  • Solid troubleshooting skills
  • Ability to interact comfortably with the other members of a large, geographically dispersed team
  • Experience maintaining an infrastructure and ensuring stability
  • Familiarity with contributing to the vLLM CI community is a big plus
  • While a Bachelor's degree or higher in computer science, mathematics, or a related discipline is valued, we prioritize technical prowess, initiative, problem solving, and practical experience

Responsibilities

  • Collaborate with research and product development teams to scale machine learning products for internal and external applications
  • Create and manage model training and deployment pipelines
  • Actively contribute to managing and releasing upstream and midstream product builds
  • Test to ensure correctness, responsiveness, and efficiency
  • Troubleshoot, debug and upgrade Dev & Test pipelines
  • Identifying and deploying cybersecurity measures by continuously performing vulnerability assessment and risk management
  • Collaborate with a cross-functional team about market requirements and best practices
  • Keep abreast of the latest technologies and standards in the field

Benefits

  • Comprehensive medical, dental, and vision coverage
  • Flexible Spending Account - healthcare and dependent care
  • Health Savings Account - high deductible medical plan
  • Retirement 401(k) with employer match
  • Paid time off and holidays
  • Paid parental leave plans for all new parents
  • Leave benefits including disability, paid family medical leave, and paid military leave
  • Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Professional, Scientific, and Technical Services

Education Level

Bachelor's degree

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service