The Reliability Platform role is a key pillar of DoorDash’s Production Lifecycle team, alongside Observability and Deploy Platform. This group’s mandate is to enable users and agents to reason about the health of our services, facilitate change control safety, and provide the means to rapidly address any unexpected state. Ownership is fundamental in DoorDash culture, and all teams own what they build. We are not here to operate services on others’ behalf, but to provide tools that enable their success and ensure a consistently high level of quality for everything we do. We approach challenges with the pragmatic perspective of an SRE, and deliver solutions with the mindset of a SWE who detests toil and repetitive tasks. We use software and agents to “keep the lights on” and focus our energy on innovation that will level up the entire organization. This mission falls into three main categories: Service Health – Providing SLO frameworks, analytics tools, and AI Agent enablement to extract high quality insights from our telemetry to pinpoint faults, or highlight deficiencies; Change Orchestration – Provide self-service provisioning orchestration, evolving from UI to Agent-driven to allow our developers to safely affect production from their IDE; Incident Management – Define and deliver tools/processes/policies leveraged by our peers to quickly understand and recover from any unexpected issues in the environment. This mandate implies a broad contribution across many aspects of the infrastructure, and demands equal parts software development and systems integration. Our priorities are always informed by an obsession to level up over 4,000 internal customers/peers, and obfuscate infrastructure complexity so they can focus on making the DoorDash product itself amazing! As a Software Engineer on the Reliability Platform team, you’ll help design, build, and operate services and infrastructure that deliver on the team’s broad mandate described above. This team has a unique opportunity for breadth, often in collaboration with expert peers across the Infrastructure and Product teams. Depending on need and interest, you may be working on mission-critical back-end services or pipelines, complex orchestration workflows, self-service UI, or AI Agent continuous improvements. We have fully embraced the use of AI tools in everything we do, and believe in the incredible potential this provides while remaining pragmatic enough to ensure the critical infrastructure we maintain cannot be compromised. Our goal is to deliver innovative next generation capabilities, as well as make data in our custody available to others pursuing the same. A few examples of efforts the team has owned in recent years: Delivering framework to capture/alert/report on SLO quality across tens of thousands of endpoints ensuring all teams are accountable for the quality of their delivered services; Replacement of our escalation management tools including alignment with our internal Asset/Team Catalog to allow automated alert routing and cross-brand alignment; Delivery of MCP back-end for Reliability Platform data/tools, as well as enabling the same for peer teams across the Core Infrastructure organization; Design and delivered orchestration tools to enable self-service provisioning of critical infrastructure (Kafka topics, Databases, CPU/GPU Pools, Service Scaffolding, etc); PoC for internal SRE AI Agentic tooling leveraging internal MCPs and domain specific profiles to facilitate troubleshooting and Q&A capabilities replacing FAQs/Runbooks; Delivered per-pod realtime configuration key-value tooling enabling runtime feature flag management from a central source of truth across the fleet (100K+ pods). We are proud of our engineering culture, and many of our greatest successes are born from an individual with an idea spending some time hacking out a rudimentary demonstrable prototype. The mandate of this team is ripe for individuals with this creative pioneering mindset, and the ability to execute.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed