Senior Machine Learning Ops Engineer, GenAI

Rivian and VW Group TechnologyPalo Alto, CA
59d

About The Position

Rivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive’s next chapter. From operating systems to zonal controllers to cloud and connectivity solutions, we’re addressing the challenges of electric vehicles through technology that will set the standards for software-defined vehicles around the world. The road to the future is uncharted. By combining our expertise across connectivity, AI, security and more, we’ll map a new way forward. Working together, we’ll create a future that’s more connected, more intelligent, more sustainable for everyone. As an ML Ops Engineer, you will be instrumental in building and maintaining a scalable training and inference platform using both Databricks and open-source technologies. Your role will focus on managing the ML/AI model life cycles in production, including running Large Language Models (LLMs) on bare metal GPUs. You will work with distributed training frameworks and cloud technologies to ensure robust and efficient ML operations.

Requirements

  • Educational Background: Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • Proven Experience: 5+ years of experience in ML Ops, infrastructure, or related fields, with a focus on managing ML/AI models in production.
  • Technical Expertise: Proficiency in distributed training frameworks (Torch Distributed, Ray) and ML frameworks (Kubeflow, MLflow, Argent, Weights & Biases).
  • Cloud Proficiency: Strong experience with cloud technologies, including Kubernetes and major cloud providers (AWS, GCP, Azure).
  • Programming Skills: Expertise in programming languages such as Python, and familiarity with ML libraries and tools.
  • Problem-Solving Skills: Strong analytical and problem-solving skills, with the ability to troubleshoot complex ML infrastructure issues.
  • Collaborative Mindset: Excellent communication and teamwork skills, with the ability to work effectively in a cross-functional team environment.
  • Passion for Innovation: A keen interest in exploring and applying the latest advancements in ML Ops and infrastructure to drive innovation.

Responsibilities

  • Develop Scalable ML Infrastructure: Design and implement a scalable training and inference platform using Databricks and open-source technologies to support ML/AI solutions.
  • Manage Model Life Cycles: Oversee the end-to-end life cycle of ML/AI models in production, ensuring efficient deployment, monitoring, and maintenance.
  • Run LLMs on Bare Metal GPUs: Optimize and manage the execution of Large Language Models on bare metal GPUs to enhance performance and scalability.
  • Utilize Distributed Training Frameworks: Leverage distributed training frameworks such as Torch Distributed and Ray to improve training efficiency and model performance.
  • Implement ML Frameworks: Work with frameworks like Kubeflow, MLflow, Argent, and Weights & Biases to streamline ML operations and model management.
  • Leverage Cloud Technologies: Utilize cloud platforms such as Kubernetes, AWS, GCP, and Azure to build and manage scalable ML infrastructure.
  • Collaborate with Cross-Functional Teams: Work closely with data scientists, software engineers, and other stakeholders to integrate ML solutions into existing systems and workflows.
  • Establish Best Practices: Define and implement best practices for ML Ops, ensuring scalability, reliability, and maintainability of ML solutions.
  • Stay Informed on Industry Trends: Continuously research and incorporate emerging trends and technologies in ML Ops and infrastructure to enhance our capabilities.

Benefits

  • Rivian and Volkswagen Group Technologies provides robust medical/Rx, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and children up to age 26. Coverage is effective on the first day of employment, and Rivian covers most of the premiums.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service