Senior Principal MLOps Engineer, AI Inference

Red River•Boston, MA

115d

About The Position

At Red Hat we believe the future of AI is open and we are on a mission to bring the power of open-source LLMs and vLLM to every enterprise. Red Hat Inference team accelerates AI for the enterprise and brings operational simplicity to GenAI deployments. As leading developers, maintainers of the vLLM project, and inventors of state-of-the-art techniques for model compression, our team provides a stable platform for enterprises to build, optimize, and scale LLM deployments. We are seeking an experienced ML Ops engineer to work closely with our product and research teams to scale SOTA deep learning products and software. As an ML Ops engineer, you will work closely with our technical and research teams to manage training and deployment pipelines, create DevOps and CI/CD infrastructure, and scale our current technology stack. If you are someone who wants to contribute to solving challenging technical problems at the forefront of deep learning, this is the role for you! In this role, your primary responsibility will be to build and release the Red Hat AI Inference runtimes, continuously improve the processes and tooling used by the DevOps team, and find opportunities to automate procedures and tasks. Join us in shaping the future of AI!

Requirements

2+ years of experience in MLOps, DevOps, Automation and modern Software Deployment practices
Strong experience with Git, Github Actions including self-hosted runners, Terraform, Jenkins, Ansible, and common technologies for automation and monitoring
Highly experienced with administering Kubernetes/Openshift
Familiar with Agile development methodology
Experience with Cloud Computing using at least one of the following Cloud infrastructures: AWS, GCP, Azure, or IBM Cloud
Solid programming skills especially in Python
Solid troubleshooting skills
Ability to interact comfortably with the other members of a large, geographically dispersed team
Experience maintaining an infrastructure and ensuring stability

Nice To Haves

Familiarity with contributing to the vLLM CI community is a big plus
While a Bachelor’s degree or higher in computer science, mathematics, or a related discipline is valued, we prioritize technical prowess, initiative, problem solving, and practical experience

Responsibilities

Collaborate with research and product development teams to scale machine learning products for internal and external applications
Create and manage model training and deployment pipelines
Actively contribute to managing and releasing upstream and midstream product builds
Test to ensure correctness, responsiveness, and efficiency
Troubleshoot, debug and upgrade Dev & Test pipelines
Identifying and deploying cybersecurity measures by continuously performing vulnerability assessment and risk management
Collaborate with a cross-functional team about market requirements and best practices
Keep abreast of the latest technologies and standards in the field

Benefits

Comprehensive medical, dental, and vision coverage
Flexible Spending Account - healthcare and dependent care
Health Savings Account - high deductible medical plan
Retirement 401(k) with employer match
Paid time off and holidays
Paid parental leave plans for all new parents
Leave benefits including disability, paid family medical leave, and paid military leave
Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume