ML Software Engineer

eBaySan Jose, CA
$172,000 - $229,600Remote

About The Position

The Observability Platform team, part of eBay's core Site Reliability Engineering (SRE) organization, is dedicated to enhancing the reliability, performance, and efficiency of eBay's global platform. Our mission is to build intelligent, scalable tools and solutions that empower our SRE and domain engineering teams to maintain operational excellence. We develop and maintain a suite of advanced, AI-driven systems by employing a wealth of operational data. Our real-time anomaly detection platform analyzes high-volume time-series metrics to predict and flag service degradations. We automate troubleshooting with a sophisticated root cause analysis engine that correlates metrics, events, logs, and traces to pinpoint failure origins. Furthermore, we are pioneering the use of GenAI to build an LLM-based agentic system to automate complex operational tasks, and a novel suite of AI-powered explainability tools to clarify the behavior of distributed systems.

Requirements

  • BS/BA or MS in Computer Science or a related field with 7+ years of proven experience in Software Engineering or Machine Learning.
  • Strong hands-on experience applying machine learning to operational data, including time-series analysis, anomaly detection, or NLP on system logs and traces.
  • Proven experience with AI/GenAI, including hands-on work with Large Language Models (LLMs), prompt engineering, and building agentic systems or RAG (Retrieval-Augmented Generation) applications.
  • Strong programming skills in languages like Python or Go.
  • Hands-on experience with the operational side of machine learning, including model deployment, monitoring, and lifecycle management using tools like Kubernetes and Docker.
  • Experience with ML frameworks like PyTorch, TensorFlow, or scikit-learn.
  • Strong understanding of SQL and NoSQL databases.

Nice To Haves

  • Experience with time-series or analytical databases (e.g., Prometheus, ClickHouse) is a significant plus.

Responsibilities

  • Advance our anomaly detection capabilities, developing and productionalizing time-series models (both statistical and NN-based) on real-time metric streams.
  • Enhance our automated root cause analysis engine by applying advanced correlation techniques and machine learning models to pinpoint the source of system failures from metrics, events, logs, and traces.
  • Develop innovative GenAI/LLM-powered tools and drive the evolution of our existing solutions, such as an LLM-based agent for automating operations and a suite of AI-powered explainers for diagnosing complex system behaviors.
  • Design and develop scalable data pipelines to process massive volumes of observability data that fuel all our ML/AI systems.
  • Collaborate closely with SREs, platform architects, and domain engineering teams to understand their operational challenges and deliver solutions that improve system reliability and reduce mean time to resolution (MTTR).
  • Own the entire software and model lifecycle, from initial design and prototyping to development, testing, deployment, and operational maintenance.

Benefits

  • 401(k) eligibility
  • various paid time off benefits, such as PTO and parental leave
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service