About The Position

The Trade Desk is a global technology company with a mission to create a better, more open internet for everyone through principled, intelligent advertising. Handling over 1 trillion queries per day, our platform operates at an unprecedented scale. We have also built something even stronger and more valuable: an award-winning culture based on trust, ownership, empathy, and collaboration. We value the unique experiences and perspectives that each person brings to The Trade Desk, and we are committed to fostering inclusive spaces where everyone can bring their authentic selves to work every day. Do you have a passion for solving hard problems at scale? Are you eager to join a dynamic, globally- connected team where your contributions will make a meaningful difference in building a better media ecosystem? Come and see why Fortune magazine consistently ranks The Trade Desk among the best small- to medium-sized workplaces globally. About the Team: The Service Excellence (SE) team owns the tools and infrastructure that help engineers at The Trade Desk understand and operate production systems. The Incident Response Services (IRS) taskforce focuses on the on-call experience. The team is responsible for making incidents easier to detect, manage, and optimize using historical data points information. What you will work on: Incident management tooling Build and maintain automation around the incident lifecycle: alerting, escalation, incident channels, retros, and SLA tracking Help evaluate and migrate our logging stack Participate in the re-evaluation of our logging vendor and collection architecture Backstage/Service catalog — Extend our internal developer portal with K8s integrations, maturity models, and SLO adoption tooling Alert quality tooling — Build the systems that give engineers better signal and less noise — smarter routing, better grouping, tighter feedback loops between alerts and the teams that own them

Requirements

  • Experience building and operating production infrastructure or internal developer tooling
  • Comfort working across the stack — this role touches distributed systems, Kubernetes, observability pipelines, and web-based tooling
  • Familiarity with observability concepts: logging, alerting, on-call workflows
  • Strong debugging instincts: You will be expected to be called on when things break
  • Clear communication: The team works closely with engineers across the company; you'll need to explain tradeoffs and advocate for solutions

Nice To Haves

  • Experience with Grafana, Prometheus, or similar observability tools
  • Familiarity with Sumo Logic or other log management platforms
  • Prior work on developer portals or service catalog tooling (Backstage, OpsLevel, etc.)
  • Experience with Kubernetes at scale

Responsibilities

  • Build and maintain automation around the incident lifecycle: alerting, escalation, incident channels, retros, and SLA tracking
  • Help evaluate and migrate our logging stack
  • Participate in the re-evaluation of our logging vendor and collection architecture
  • Extend our internal developer portal with K8s integrations, maturity models, and SLO adoption tooling
  • Build the systems that give engineers better signal and less noise — smarter routing, better grouping, tighter feedback loops between alerts and the teams that own them

Benefits

  • comprehensive healthcare (medical, dental, and vision) with premiums paid in full for employees and dependents
  • retirement benefits such as a 401k plan and company match
  • short and long-term disability coverage
  • basic life insurance
  • well-being benefits
  • reimbursement for certain tuition expenses
  • parental leave
  • sick time of 1 hour per 30 hours worked
  • vacation time for full-time employees up to 120 hours thru the first year and 160 hours thereafter
  • around 13 paid holidays per year
  • Employees can also purchase The Trade Desk stock at a discount through The Trade Desk’s Employee Stock Purchase Plan
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service