This is your chance to build a reliability practice from the ground up and establish the engineering standards—including SLOs, error budgets, and observability—that will protect our platform as we scale for enterprise customers and expand our AI capabilities. You’ll have the autonomy to set the strategy and the trust to execute it, ensuring that our AI workloads (Evals, RAG, and LLM processing) meet the highest reliability standards. If you are a proactive problem solver who treats toil as an engineering challenge and wants the agency to decide which technologies to adopt and when, you will find this to be a career-defining role. As a Staff or Senior Staff SRE, you’ll hit the ground running by partnering with the engineers currently managing reliability to transition the organization from reactive firefighting to a proactive, disciplined reliability practice. You will lead the deliberate evolution of our infrastructure, recognizing the inflection point for new tooling and leading migrations away from manual scripts and templates only when they’ve earned their keep. Whether you are architecting incident response structures or solving novel reliability problems for AI agents, your work will act as a multiplier that empowers the entire engineering team. By bringing a consulting mindset to every challenge, you’ll propose technical trade-offs based on evidence rather than reflex, ensuring our roadmap for multi-region or service mesh adoption is built for tomorrow. You won't just be handed tasks; you will own the strategy for production-readiness and deploy safety, building the organizational trust needed to make reliability a core differentiator of our product.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed
Number of Employees
101-250 employees