At eBay, we're more than a global ecommerce leader — we’re changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We’re committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts. Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work — every day. We're in this together, sustaining the future of our customers, our company, and our planet. Join a team of passionate thinkers, innovators, and dreamers — and help us connect people and build communities to create economic opportunity for all. About the team and role: The Observability Platform team, part of eBay's core Site Reliability Engineering (SRE) organization, is dedicated to enhancing the reliability, performance, and efficiency of eBay's global platform. Our mission is to build intelligent, scalable tools and solutions that empower our SRE and domain engineering teams to maintain operational excellence. We develop and maintain a suite of advanced, AI-driven systems by leveraging a wealth of operational data. Our real-time anomaly detection platform analyzes high-volume time-series metrics to predict and flag service degradations. We automate troubleshooting with a sophisticated root cause analysis engine that correlates metrics, events, logs, and traces to pinpoint failure origins. Furthermore, we are pioneering the use of GenAI to build an LLM-based agentic system to automate complex operational tasks, and a novel suite of AI-powered explainability tools to clarify the behavior of distributed systems. What You Will Accomplish: Advance our anomaly detection capabilities , developing and productionalizing time-series models (both statistical and NN-based) on real-time metric streams. Enhance our automated root cause analysis engine by applying advanced correlation techniques and machine learning models to pinpoint the source of system failures from metrics, events, logs, and traces. Develop innovative GenAI/LLM-powered tools and drive the evolution of our existing solutions, such as an LLM-based agent for automating operations and a suite of AI-powered explainers for diagnosing complex system behaviors. Design and develop scalable data pipelines to process massive volumes of observability data that fuel all our ML/AI systems. Collaborate closely with SREs, platform architects, and domain engineering teams to understand their operational challenges and deliver solutions that improve system reliability and reduce mean time to resolution (MTTR). Own the entire software and model lifecycle , from initial design and prototyping to development, testing, deployment, and operational maintenance.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level