Software Engineer-3

eBaySan Jose, CA
1d

About The Position

At eBay, we're more than a global ecommerce leader — we’re changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We’re committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts. Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work — every day. We're in this together, sustaining the future of our customers, our company, and our planet. Join a team of passionate thinkers, innovators, and dreamers — and help us connect people and build communities to create economic opportunity for all. About the team and role: The Observability Platform team, part of eBay's core Site Reliability Engineering (SRE) organization, is dedicated to enhancing the reliability, performance, and efficiency of eBay's global platform. Our mission is to build intelligent, scalable tools and solutions that empower our SRE and domain engineering teams to maintain operational excellence. We develop and maintain a suite of advanced, AI-driven systems by leveraging a wealth of operational data. Our real-time anomaly detection platform analyzes high-volume time-series metrics to predict and flag service degradations. We automate troubleshooting with a sophisticated root cause analysis engine that correlates metrics, events, logs, and traces to pinpoint failure origins. Furthermore, we are pioneering the use of GenAI to build an LLM-based agentic system to automate complex operational tasks, and a novel suite of AI-powered explainability tools to clarify the behavior of distributed systems.

Requirements

  • MS in Computer Science or a related field with 4+ years of relevant experience (or BS/BA with 6+ years) in Software Engineering or Machine Learning.
  • Strong hands-on experience applying machine learning to operational data, including time-series analysis, anomaly detection, or NLP on system logs and traces.
  • Proven experience with AI/GenAI, including hands-on work with Large Language Models (LLMs), prompt engineering, and building agentic systems or RAG (Retrieval-Augmented Generation) applications.
  • Strong programming skills in languages like Python or Go.
  • Hands-on experience with the operational side of machine learning, including model deployment, monitoring, and lifecycle management using tools like Kubernetes and Docker.
  • Experience with ML frameworks like PyTorch, TensorFlow, or scikit-learn.
  • Strong understanding of SQL and NoSQL databases.

Nice To Haves

  • Experience with time-series or analytical databases (e.g., Prometheus, ClickHouse) is a significant plus.
  • Experience with core components of modern observability stacks (e.g., metrics collection/storage like Prometheus; logging like Loki; tracing like Jaeger/Tempo; visualization like Grafana) and container orchestration platforms like Kubernetes is a significant plus

Responsibilities

  • Advance our anomaly detection capabilities, developing and productionalizing time-series models (both statistical and NN-based) on real-time metric streams.
  • Enhance our automated root cause analysis engine by applying advanced correlation techniques and machine learning models to pinpoint the source of system failures from metrics, events, logs, and traces.
  • Develop innovative GenAI/LLM-powered tools and drive the evolution of our existing solutions, such as an LLM-based agent for automating operations and a suite of AI-powered explainers for diagnosing complex system behaviors.
  • Design and develop scalable data pipelines to process massive volumes of observability data that fuel all our ML/AI systems.
  • Collaborate closely with SREs, platform architects, and domain engineering teams to understand their operational challenges and deliver solutions that improve system reliability and reduce mean time to resolution (MTTR).
  • Own the entire software and model lifecycle, from initial design and prototyping to development, testing, deployment, and operational maintenance.

Benefits

  • a target bonus
  • restricted stock units (as applicable)
  • a full range of medical, financial, and/or other benefits (including 401(k) eligibility and various paid time off benefits, such as PTO and parental leave)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service