Infrastructure Operations Engineer

Lightning AI•San Francisco, NY

50d•Hybrid

About The Position

Lightning AI is seeking an experienced Infrastructure Operations Engineers to help scale and operate our next-generation AI infrastructure platform. Our InfraOps team sits at the center of reliability, automation, and operational scale for GPU infrastructure. This team owns break/fix operations, incident response, customer provisioning, observability, and the automation systems that keep complex infrastructure running efficiently. In this role, you’ll work hands-on with large-scale GPU environments, Linux systems, bare metal infrastructure, provisioning workflows, and platform reliability. You’ll partner closely with Infrastructure Engineering, Network Operations, and Software Platform teams to troubleshoot issues, improve operational efficiency, and build automation that reduces manual toil over time. We’re flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time.

Requirements

8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience.
5+ years experience with AWS.
2+ years experience with Kubernetes and strong container fundamentals.
2+ years experience with Terraform and Ansible
2+ years with network attached storage management (via NFS, ceph, or other protocols).
Experience with monitoring systems (Prometheus, ELK stack).
Familiarity with the gitops workflow.
Software development experience using Python, Go, bash, or other languages for the purposes of automation & connecting systems & APIs together.
Deep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and Infiniband.
Experience building and delivering complex systems.
Effective at navigating tradeoffs between design, risk, cost, and outcomes.
Comfortable with navigating ambiguity.
Strong written and oral communication.

Nice To Haves

Experience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardware.
Experience with GPU servers, both in bare metal form or under virtualization.
Deep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls and Juniper Networks as vendors.
Experience with VAST storage systems.

Responsibilities

Design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features.
Deploy updates and improvements to support both Voltage Park’s internal and end customer use cases.
Collaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development Teams.
Participate in the on-call rotation which is evenly distributed across all team members in a primary / secondary pattern where you are primary then move to a secondary position.

Benefits

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
Generous paid time off, plus holidays
Paid parental leave
Professional development support
Wellness and work-from-home stipends
Flexible work environment
discretionary bonus
meaningful equity component

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume