Principal SRE

Chalkboard•New York, NY

About The Position

Chalkboard is building the future of sports gaming. Our mission is to blur the line between watching and playing by turning real-money sports gaming into a social, immersive experience built for fans who play to win. We're not just creating another betting app. We're reimagining how sports fans engage with the games they love. At our core, we’re a team of sports-obsessed builders who value clarity, fairness, and the thrill of helping fans turn insight into action. We’re looking for a Principal Site Reliability Engineer to join Chalkboard and help us build a platform that is reliable, scalable, and easy for teams to build on. In this role you’ll work closely with Engineering, Product, and Data teams, playing a meaningful part in how millions of fans experience sports in real time. If you’re someone who loves building from scratch, thrives in fast-moving environments, and wants to win as a team—not just an MVP, keep reading!

Requirements

Cloud Infrastructure (GCP preferred): networking, IAM, databases, storage
Kubernetes: cluster operations and workload management
Infrastructure as Code: Terraform, Helm
CI/CD: GitHub Actions or similar
Observability: metrics, logging, tracing, alerting
8+ years of experience in SRE, platform engineering, or infrastructure roles
Strong experience with distributed systems and backend architectures
Proven ability to improve system reliability, scalability, and performance
Experience building and improving CI/CD pipelines and deployment workflows
Strong debugging skills using data (logs, metrics, traces)
Experience leading incident response and driving root cause analysis
Ability to make pragmatic tradeoffs between speed, reliability, and scale
Experience partnering across engineering teams to improve developer velocity

Nice To Haves

Experience with Go or backend frameworks like Nest.js
Experience with Datadog or similar observability platforms
Familiarity with Postgres, MongoDB, Firestore, or Redis
Experience with messaging systems like RabbitMQ
Experience with GitOps tools (FluxCD, Kustomize)
Passion for sports, gaming, or betting products

Responsibilities

Own platform reliability end-to-end, proactively identifying and mitigating risks before they impact users
Build and evolve observability (metrics, logs, tracing) to enable fast detection, diagnosis, and resolution of issues
Scale infrastructure ahead of demand by identifying bottlenecks and implementing durable architecture improvements
Reduce developer friction by improving CI/CD pipelines, deployment workflows, and internal tooling
Lead incident response and root cause analysis, driving systemic fixes—not just short-term patches
Establish and enforce best practices for infrastructure, deployments, and system reliability
Build reusable, self-service infrastructure that enables teams to ship quickly and safely
Continuously improve systems through automation and Infrastructure-as-Code

Benefits

Comprehensive medical, dental, and vision coverage starting within 30 days, with the majority of premiums covered by Chalkboard
401(k) with company match
Lunch on us everyday with a corporate DoorDash account
Refuel in the office with protein shakes, energy drinks, and a snack buffet
Flexible time off policy, plus 10 company holidays, WFH during the holidays

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume