Our team manages multiple functions across Tesla that includes Devops, MLOps, Cloud Infrastructure (AWS, Azure, GCP), Factory SRE as well. Continued development and automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As a Site Reliability Engineer, you will be responsible for maintaining and improving our platform to ensure our cross functional teams have the necessary tools and resources to be productive. Mature our Machine Learning Operations Platform and advocate best practices to MLops engineers Design and implement scalable, automated workflows for the complete ML lifecycle Maintain Kubernetes-based infrastructure for model training, deployment, and monitoring Develop solutions for workload orchestration and time-slicing using tools like Flyte and Ray Collaborate with engineers to build and maintain robust, pipelines for training and inference workflows Develop Infrastructure-as-Code (IaC) solutions for deploying and managing cloud/on-prem ML environments Design and develop intuitive, user-friendly self-service portals using React to enable data scientists and engineers to manage ML pipelines, monitor models, and access resources seamlessly Package & deploy applications using Helm charts / deploy via ArgoCD Participate in 24x7 on-call rotation
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees