About The Position

Airbnb was born in 2007 when two hosts welcomed three guests to their San Francisco home, and has since grown to over 5 million hosts who have welcomed over 2 billion guest arrivals in almost every country across the globe. Every day, hosts offer unique stays and experiences that make it possible for guests to connect with communities in a more authentic way. The Community You Will Join: Viaduct is a unified data access layer connecting most of Airbnb’s online data. More than 70% of Airbnb’s API traffic flows through the Viaduct platform. Residing at the center of Airbnb’s tech stack between the user-facing products and backend infrastructure, Viaduct provides a global schema & query system through a GraphQL interface. Our team mission: “Empower app developers at Airbnb by delivering a seamless and efficient developer experience. We strive to maximize productivity and spark creativity through simplified APIs, improved performance, and the cultivation of tenant team autonomy.” The Viaduct team is a very tenured and experienced team, setting the best practices and next-gen architecture for Airbnb. As part of the Application Platform pillar of Infrastructure, we work closely with partner infra teams (Build Infra, Service Platform, CI/CD, Reliability, Observability, Developer Platform to name a few) as well as product engineers. We treat our platform as a product and follow the principles of good Platform Engineering Viaduct has been running in production for over six years and the team has gained a lot of experience in operating a GraphQL platform at scale. These insights informed the major rewrite - called Viaduct Modern - which we’re in the process of launching and continue to evolve. You will join the effort to create the best developer experience of hundreds of engineers at Airbnb using our revolutionary GraphQL platform. Viaduct has been released as an Open Source project. Your contributions to the Viaduct platform will not only serve Airbnb-internal developers, but also any member of the Open Source community who chooses to adopt Viaduct. The Difference You Will Make: Drive platform reliability and operational excellence by designing and implementing deployment pipelines, SLO frameworks, observability tooling, performance improvements, and AI-enabled incident response automation that help maintain Viaduct's 99.99% uptime target across Airbnb's critical API traffic. Contribute to runtime resiliency initiatives including resource attribution, performance regression testing, and proactive monitoring to ensure the multi-tenant GraphQL platform scales efficiently and degrades gracefully under load. Architect and deliver AI-powered operational tooling that accelerates incident triage, reduces mean-time-to-mitigation, and empowers both the Viaduct team and tenant engineers with self-service debugging capabilities. Shape the future of Viaduct Modern by contributing to the next-generation architecture, improving developer experience for hundreds of engineers, and establishing patterns that will be shared with the open-source community.

Requirements

  • 9+ years of software engineering experience, with significant depth in backend systems, distributed architectures, and platform engineering.
  • Deep expertise in observability and monitoring, including experience designing SLO frameworks, distributed tracing systems, and metrics pipelines at scale.
  • Proven track record in reliability engineering, with hands-on experience in incident response, root cause analysis, and building systems that maintain high availability (99.99%+).
  • Strong experience with performance tuning and resource management in JVM-based systems, including profiling, garbage collection optimization, and understanding of concurrency models (blocking I/O, thread pools, coroutines in Kotlin).
  • Experience operating critical, high-traffic systems with a focus on deployment safety, automated rollbacks, and progressive delivery strategies.
  • Familiarity with GraphQL or similar API gateway/data access layer technologies
  • Experience building developer tooling and platforms, with a product mindset focused on developer experience and self-service capabilities.
  • Strong leadership and communication skills with the ability to partner effectively across infrastructure and product engineering teams.

Responsibilities

  • Embrace an AI-first engineering approach, using LLM-powered agents to generate and iterate on code while you focus on problem-solving, system design, and quality oversight.
  • Investigate and resolve complex production issues by analyzing distributed traces, resource utilization patterns, and system metrics to identify root causes and implement durable fixes.
  • Design and implement observability features including span instrumentation, SLO dashboards, and fine-grained attribution for blocking time, memory, and CPU across tenant workloads.
  • Develop and iterate on tooling for deployment triage, service health monitoring, and incident response automation using LLM capabilities.
  • Lead technical design discussions and RFCs for initiatives like performance regression testing pipelines, emergency deployment workflows, and runtime resiliency improvements.
  • Partner with tenant teams to debug performance issues, provide guidance on GraphQL best practices, and enable self-service capabilities for common operational tasks.
  • Contribute to open-source Viaduct by ensuring platform improvements are generalizable and well-documented for the broader engineering community.

Benefits

  • This role may also be eligible for bonus, equity, benefits, and Employee Travel Credits.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service