This role focuses on Site Reliability Engineering (SRE) principles for AI services, with a unique opportunity to build an SRE practice from scratch. The engineer will define and implement SLOs, monitoring, incident response, operational readiness reviews, capacity planning, and toil elimination for AI services. Unlike traditional SRE roles, this position will address unique AI failure modes such as model drift and token budget exhaustion. The engineer will have direct authority over whether AI services go live based on operational readiness reviews. The role does not involve application development, AI model building, or infrastructure provisioning, but rather ensuring the operability and reliability of these components.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior