Platform Support Engineer

Lightning AI•San Francisco, CA

38d•Hybrid

About The Position

Lightning AI is seeking engineers who understand the complexities of running machine learning workloads at scale. This role is a blend of ML systems, cloud infrastructure, Kubernetes, and customer interaction. The engineer will support teams training models, deploying inference systems, and scaling GPU workloads in production. This is not a traditional support role; instead, it involves acting as a technical partner to ML teams, assisting with failure diagnosis, improving reliability, and guiding customers through intricate distributed systems challenges. The issues encountered can range from Kubernetes scheduling and GPU orchestration to distributed PyTorch failures, inference latency, networking bottlenecks, storage performance, and overall platform reliability. This position offers exposure to diverse real-world AI workloads across various industries and the opportunity to influence the infrastructure powering future ML applications.

Requirements

Strong software engineering and systems troubleshooting background.
Experience with Kubernetes and containerized environments.
Linux systems knowledge, including networking, storage, process management, and performance tuning.
Experience with cloud infrastructure and distributed systems.
Experience with observability and debugging tools such as Prometheus, Grafana, or OpenTelemetry.
Hands-on experience operating machine learning workloads in production or research environments.
Experience with distributed ML systems and tooling such as PyTorch, CUDA, or NCCL.
Familiarity with GPU infrastructure and orchestration.
Experience troubleshooting performance, reliability, or scaling issues in ML infrastructure.
Understanding of the operational challenges involved in running ML systems at scale.
Strong communication skills and ability to work directly with highly technical customers and engineering teams.
Comfortable operating in fast-moving, highly ambiguous environments.
Enjoys solving complex technical problems collaboratively.

Nice To Haves

Experience with large-scale model training or distributed inference systems.
Familiarity with Ray, Kubeflow, Slurm, or similar distributed scheduling platforms.
Experience with InfiniBand, RDMA, or high-performance networking.
Experience operating bare metal infrastructure.
Familiarity with storage systems commonly used in ML environments.
Experience working at an AI infrastructure, cloud, MLOps, or developer tooling company.
Contributions to platform engineering, developer infrastructure, or operational tooling projects.
Experience writing automation, tooling, or scripts in Python or similar languages.

Responsibilities

Partner directly with customer engineering teams running training and inference workloads in production.
Help customers diagnose and resolve complex distributed systems and ML infrastructure issues.
Act as a technical advisor during high impact incidents and platform degradation events.
Translate infrastructure level issues into actionable guidance for ML engineers.
Build credibility with customers through strong technical reasoning and clear communication.
Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues.
Analyze logs, metrics, traces, and system behavior to isolate root causes.
Debug containerized workloads running across Kubernetes and bare metal GPU environments.
Support customers scaling workloads across multi-node GPU systems.
Diagnose performance bottlenecks involving compute, memory, networking, or storage.
Identify recurring patterns across customer issues and drive long-term reliability improvements.
Contribute to post-incident reviews and operational improvements.
Build internal tooling, automation, documentation, and runbooks.
Partner closely with infrastructure, networking, and platform engineering teams.
Help improve observability, operational visibility, and troubleshooting workflows.
Improve the customer experience through better processes and technical guidance.