Lightning AI is seeking engineers who understand the challenges of running machine learning workloads at scale. This role is a blend of ML systems, cloud infrastructure, Kubernetes, and customer interaction. The successful candidate will support engineers with model training, inference system deployment, and scaling GPU workloads in production. This is not a traditional support role; it involves being a technical partner to ML teams, helping them diagnose failures, improve reliability, and navigate complex distributed systems issues. Problems can range from Kubernetes scheduling and GPU orchestration to distributed PyTorch failures, inference latency, networking bottlenecks, storage performance, and platform reliability. The role offers exposure to diverse AI workloads and the opportunity to shape the infrastructure for future ML applications.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed