Kharon-posted 9 days ago
$220,000 - $265,000/Yr
Full-time • Mid Level
Onsite • Denver, CO
51-100 employees

Reporting to the VP of Engineering , as a Staff Site Reliability Engineer, you’ll be the force behind building systems that are fast, resilient, and scalable. At Kharon, reliability isn’t just about uptime, it’s about ensuring our clients can trust critical insights when they need them most. You’ll champion best practices in observability, automation, and incident response, while collaborating across engineering, DevOps, and security to create paved roads that empower teams to deliver with confidence. From defining SLOs and streamlining on-call to optimizing performance and driving recovery readiness, your work will directly strengthen the foundation that enables Kharon to deliver mission-critical intelligence at scale. To the right person, this will be the perfect kind of challenge. Our mission is compelling, our product is powerful, and we’re growing at a rate that makes us unstoppable. If you’re looking to be surrounded by people who will inspire you to think and challenge you to grow then look no further. Our team is made up of some of the most visionary and uncompromising individuals you will ever encounter. We don’t take ourselves seriously but we’re serious about the work we do and there is absolutely no slowing us down. To keep that momentum going, we do our very best to make sure that each and every team member is completely taken care of.

  • Stand up and standardize metrics, logging, tracing, and alert hygiene; introduce golden dashboards and alert runbooks.
  • Coach engineers on reliability practices, including leading incident response (MTTA/MTTR) running blameless postmortems, reliability reviews.
  • Plan capacity, conduct load/perf tests, and drive performance tuning and cost–reliability tradeoffs.
  • Collaborate with DevOps on Kubernetes/cloud/IaC standards, including creating paved roads and production-readiness checklists for app teams.
  • Work cross functionally on resilient CI/CD (pre-deployment checks, canary/blue-green, automated rollbacks).
  • Align with security on least privilege, secrets management, and audit-ready operational practices.
  • Define RTO/RPO, backups, and failover drills; document and test recovery playbooks.
  • Identify opportunities related to repetitive work and automations (scripts, jobs, runbooks, self-service tooling).
  • Help shape on-call rotations, escalation policies, and handbooks, ultimately improving signal-to-noise and engineer well-being.
  • Assist in defining SLIs/SLOs and error budgets with product/engineering, creating visibility into availability, latency, and quality.
  • Bachelor's Degree in Computer Science, Engineering, or a related field.
  • 10-12+ years of experience in software engineering or DevOps, with at least 5+ years in a site reliability engineering (SRE) or reliability-focused role.
  • Strong networking fundamentals including DNS, Kubernetes routing, load balancing, WAF, multi-VPC routing in AWS, Traefik.
  • Solid software fundamentals (one or more of: Python, Java, Go, Scala or similar) and ability to read/modify production services.
  • Deep experience in a major cloud (AWS/GCP/Azure) and container orchestration (Kubernetes).
  • Proficiency with IaC (Terraform or equivalent), CI/CD systems, and git-based workflows.
  • Hands-on with metrics/logging/tracing systems and alerting best practices.
  • Proven incident commander experience and skillful facilitation of blameless postmortems.
  • Solid grasp of networking, HTTP, load balancing, caching, and data stores (SQL/NoSQL/queues).
  • Excellent communication, documentation, and cross-team influence.
  • Fully sponsored medical, dental, and vision
  • FSA program for both medical and dependent care
  • 401k + Roth with matching and immediate vesting
  • Paid time off + 11 paid holidays
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service