Site Reliability Engineer

Underdog Fantasy
6d$160,000 - $240,000

About The Position

This is a rare opportunity to be a founding SRE at Underdog, helping define how reliability, scalability, and operational excellence work as the company continues to grow. You’ll operate in exploration mode early on, identifying the highest-leverage reliability challenges and shaping our approach to incident response, observability, and SLOs. This is a high-impact role with real ownership from day one, partnering closely with platform, infrastructure, and product teams to ensure Underdog scales through peak traffic, game-day spikes, and rapid iteration while improving both system reliability and developer experience.

Requirements

  • A strong written and verbal communicator
  • Collaborative by nature
  • Someone who enjoys using research, data, and experiments to make decisions; you believe “Hope is not a strategy.”
  • You enjoy working directly with customers (generally engineers or other people inside the company)
  • You think long-term about what is best for the business and its customers
  • You are excited to take ownership
  • You are very comfortable around an IDE, working with multiple languages, multiple web application frameworks, AWS services, Kubernetes, PostgreSQL
  • You can work independently to learn new languages/technologies as needed
  • You enjoy deploying changes to production quickly, multiple times a week if necessary

Nice To Haves

  • Experience with PostgreSQL SQL query optimization, tweaking autovacuum settings, table statistics, different index types, etc.
  • Experience with Redis / Valkey Optimization
  • Experience with Datadog or similar observability tools
  • Experience working as a web application developer, frontend or backend, especially in React and Ruby on Rails
  • Experience with AWS cost optimization
  • Read the Google SRE books or similar books, or have other forms of SRE training
  • Actively leveraging the capabilities of AI to augment abilities and gain knowledge about interested domains

Responsibilities

  • Own and maintain the incident response process, including defining procedures, tools, and best practices
  • Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems
  • Lead capacity planning initiatives, focusing on both short and long-term scalability while optimizing costs
  • Develop and implement disaster recovery plans, including regular testing and regulatory compliance
  • Collaborate with teams on architecture decisions to ensure high availability and scalability
  • Manage launch and event planning for high-traffic occasions, focusing on infrastructure preparation and capacity management (a.k.a. Launch Readiness)
  • Act as an internal expert and consultant for monitoring tools like Datadog and Pagerduty and infrastructure like AWS and Kubernetes
  • Emphasis on automation and tooling to scale our workload
  • Contribute across codebases in Ruby, Python, Go, TypeScript, Swift, and Kotlin as needed to support the initiatives described above.

Benefits

  • Unlimited PTO (we're extremely flexible with the exception of the first few weeks before & into the NFL season)
  • 16 weeks of fully paid parental leave
  • Home office stipend
  • A connected virtual first culture with a highly engaged distributed workforce
  • 5% 401k match, FSA, company paid health, dental, vision plan options for employees and dependents
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service