We are looking for a Senior Site Reliability Engineer who combines deep infrastructure expertise with a forward-thinking approach to AI-driven operations. In this role you will maintain and improve the reliability, scalability, and performance of our Java-based applications while pioneering the use of large language models (LLMs), agentic workflows, and intelligent automation to transform how we monitor, respond to, and prevent incidents. You will design and build autonomous and semi-autonomous AI agents that consume observability data, triage alerts, generate runbooks, automate incident response steps, and surface actionable insights—reducing toil and accelerating mean time to resolution. This is a hands-on engineering role for someone who is equally comfortable tuning a JVM, writing PromQL, and prototyping an agentic pipeline with tool-calling LLMs.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed