About The Position

In this role, you will be a key pillar of our engineering organization, ensuring that our services remain highly available and performant. Your impact will include: System Architecture: Designing and implementing the next generation of our telemetry and alerting systems. Reliability Engineering: Defining SLOs/SLIs and ensuring our monitoring strategy captures the true health of the user experience. Operational Excellence: Reducing operational load through software; if you have to do it twice, you’ll want to automate it. Collaboration: Partnering with App Dev teams to influence the "design for reliability" phase of the software development lifecycle. Mentorship: Acting as a technical lead for junior members and off-shore partners, providing guidance on runbook development and disaster recovery.

Requirements

  • 5+ years in SRE, DevOps, or Infrastructure roles with a proven track record of managing high-traffic, internet-facing production environments.
  • Deep experience building and operating container orchestration systems (EKS/GKE/Vanilla K8s). You should be comfortable troubleshooting from the networking layer up to the application pod.
  • Expert knowledge of the 4 Golden Signals (Latency, Traffic, Errors, and Saturation). Proficiency with tools like Prometheus, Grafana, and Splunk is essential.
  • Hands-on experience designing and maintaining resilient infrastructure on public cloud providers (AWS, GCP, or Azure).
  • Strong ability to code at a scripting level (Python or Go preferred) to automate toil and build self-healing systems.
  • Experience leading incident response, performing Root Cause Analysis (RCA), and implementing blameless post-mortems to improve system resilience.
  • Proficient in Terraform, CloudFormation, or Pulumi to manage immutable infrastructure.

Nice To Haves

  • Specialized experience operating and tuning Solr or Elasticsearch at scale.
  • Strong understanding of TCP/IP, Load Balancing (ELB/ALB), and Service Mesh (Istio/Linkerd).
  • Experience with Kafka, Cassandra, or Postgres in a distributed environment.

Responsibilities

  • Designing and implementing the next generation of our telemetry and alerting systems.
  • Defining SLOs/SLIs and ensuring our monitoring strategy captures the true health of the user experience.
  • Reducing operational load through software; if you have to do it twice, you’ll want to automate it.
  • Partnering with App Dev teams to influence the "design for reliability" phase of the software development lifecycle.
  • Acting as a technical lead for junior members and off-shore partners, providing guidance on runbook development and disaster recovery.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service