About The Position

Netflix is one of the world's leading entertainment services, with over 300 million paid memberships in over 190 countries enjoying TV series, films and games across a wide variety of genres and languages. Members can play, pause and resume watching as much as they want, anytime, anywhere, and can change their plans at any time. This role is part of the Exploration and Troubleshooting team, a key part of our Observability engineering group, the “eyes and ears” of Netflix engineering. Observability engineering provides the platform and suite of products that allow Netflix engineers to understand how their services behave in real-time, detect anomalies in system health, and troubleshoot and remediate problems. Our platform processes billions of data points in real time every minute. The success of our platform and products is crucial to Netflix's success and our ability to operate the Netflix cloud.

Requirements

  • Industry Experience: You have 8+ years of software engineering experience.
  • Gen AI stack: You have strong interest and experience with the latest GenAI stack (LLMs, RAG, Agents). You have familiarity with workflow engines like Temporal, AWS Lambda, AWS AgentCore, LangGraph, AI Observability systems like Braintrust
  • Distributed systems: You have experience in building and operating scalable, observable, fault-tolerant, distributed systems. You have experience with AWS services.
  • Tech stack: You are proficient in Java, GRPC, Python. Familiarity with Scala is a plus.
  • Full lifecycle engineer: You are knowledgeable about and are willing to own all areas of the software lifecycle: design, development, test, deploy, operate, and support.

Nice To Haves

  • Observability Experience: You have extensive knowledge about or built observability products like logs, metrics, and traces.

Responsibilities

  • Design and implement the distributed backbone for Netflix Observability's agentic AI-driven analysis, inference, and orchestration systems.
  • Develop robust ingestion and correlation layers that unify signals from logs, metrics, traces, and alerts across cloud and on-prem environments.
  • Optimize and extend workflows for real-time, actionable recommendations and RCA automation.
  • Collaborate cross-functionally with Observability, SRE, and Platform teams to scale AutoSRE as a self-serve, extensible AI agent for Netflix engineering.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service