Lightning AI is seeking engineers who understand the complexities of running machine learning workloads at scale. This role is a blend of ML systems, cloud infrastructure, Kubernetes, and customer interaction. The engineer will support teams training models, deploying inference systems, and scaling GPU workloads in production. This is not a traditional support role; instead, it involves acting as a technical partner to ML teams, assisting with failure diagnosis, improving reliability, and guiding customers through intricate distributed systems challenges. The issues encountered can range from Kubernetes scheduling and GPU orchestration to distributed PyTorch failures, inference latency, networking bottlenecks, storage performance, and overall platform reliability. This position offers exposure to diverse real-world AI workloads across various industries and the opportunity to influence the infrastructure powering future ML applications.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed