Senior Software Engineer - Observability & IRM

The Trade Desk•Boulder, CO

About The Position

The Trade Desk is a global technology company with a mission to create a better, more open internet for everyone through principled, intelligent advertising. Handling over 1 trillion queries per day, our platform operates at an unprecedented scale. We have also built something even stronger and more valuable: an award-winning culture based on trust, ownership, empathy, and collaboration. We value the unique experiences and perspectives that each person brings to The Trade Desk, and we are committed to fostering inclusive spaces where everyone can bring their authentic selves to work every day. Do you have a passion for solving hard problems at scale? Are you eager to join a dynamic, globally- connected team where your contributions will make a meaningful difference in building a better media ecosystem? Come and see why Fortune magazine consistently ranks The Trade Desk among the best small- to medium-sized workplaces globally. About the Team: The Service Excellence (SE) team owns the tools and infrastructure that help engineers at The Trade Desk understand and operate production systems. The Incident Response Services (IRS) taskforce focuses on the on-call experience. The team is responsible for making incidents easier to detect, manage, and optimize using historical data points information. What you will work on: Incident management tooling Build and maintain automation around the incident lifecycle: alerting, escalation, incident channels, retros, and SLA tracking Help evaluate and migrate our logging stack Participate in the re-evaluation of our logging vendor and collection architecture Backstage/Service catalog — Extend our internal developer portal with K8s integrations, maturity models, and SLO adoption tooling Alert quality tooling — Build the systems that give engineers better signal and less noise — smarter routing, better grouping, tighter feedback loops between alerts and the teams that own them

Requirements

Experience building and operating production infrastructure or internal developer tooling
Comfort working across the stack — this role touches distributed systems, Kubernetes, observability pipelines, and web-based tooling
Familiarity with observability concepts: logging, alerting, on-call workflows
Strong debugging instincts: You will be expected to be called on when things break
Clear communication: The team works closely with engineers across the company; you'll need to explain tradeoffs and advocate for solutions

Nice To Haves

Experience with Grafana, Prometheus, or similar observability tools
Familiarity with Sumo Logic or other log management platforms
Prior work on developer portals or service catalog tooling (Backstage, OpsLevel, etc.)
Experience with Kubernetes at scale

Responsibilities

Build and maintain automation around the incident lifecycle: alerting, escalation, incident channels, retros, and SLA tracking
Help evaluate and migrate our logging stack
Participate in the re-evaluation of our logging vendor and collection architecture
Extend our internal developer portal with K8s integrations, maturity models, and SLO adoption tooling
Build the systems that give engineers better signal and less noise — smarter routing, better grouping, tighter feedback loops between alerts and the teams that own them

Benefits

comprehensive healthcare (medical, dental, and vision) with premiums paid in full for employees and dependents
retirement benefits such as a 401k plan and company match
short and long-term disability coverage
basic life insurance
well-being benefits
reimbursement for certain tuition expenses
parental leave
sick time of 1 hour per 30 hours worked
vacation time for full-time employees up to 120 hours thru the first year and 160 hours thereafter
around 13 paid holidays per year
Employees can also purchase The Trade Desk stock at a discount through The Trade Desk’s Employee Stock Purchase Plan