Senior Site Reliability Engineer

Branch MetricsAustin, TX
4h$127,000 - $165,000Remote

About The Position

At Branch, we power every touchpoint with links that work and insights that prove it. From click to conversion, we make growth measurable. Our unparalleled attribution, backed by AI-enhanced linking, is trusted to deliver seamless experiences that increase ROI, decrease wasted spend, and eliminate siloed attribution. We bring the same rigor to how we build our team, by empowering our people to move fast, own outcomes, and build something that matters. We take pride in making meaningful investments in our team’s health, wealth, and growth so individuals can thrive as we scale. Our culture values smart, humble, and collaborative teammates who take accountability and drive results in an environment where their work truly moves the business forward. We are innovative, scaling with purpose, and led by seasoned leaders who know how to build enduring companies. Trusted by brands like Instacart, Western Union, NBCUniversal, ZocDoc, and Sephora, we’re big enough to matter, small enough for you to make a real impact. If you’re excited by the grit of building, rapid learning, and shaping the future of customer growth, you’ll find your place here. We are seeking a highly experienced Senior Site Reliability Engineer to own the reliability, performance, and operational excellence of our large-scale, distributed infrastructure. You will lead design and execution of systems that power mission critical services, shaping engineering practices, influencing architectural decisions, and driving automation and resiliency across the organization.

Requirements

  • 6+ years in SRE, systems engineering, or software engineering roles, ideally within fast-paced, rapidly scaling environments.
  • Proven track record as a senior reliability or production engineer, with ownership of large, distributed, customer-facing systems.
  • Expert level proficiency in Kubernetes, AWS, Linux internals, and distributed system fundamentals.
  • Strong programming skills in Go, Python, Java, Kotlin, Bash, or similar languages, with an emphasis on building reliable automation and tooling.
  • Hands-on experience with modern observability stacks (Prometheus, Grafana, AlertManager, Loki, PagerDuty).
  • Familiarity with large scale data and streaming ecosystems such as Kafka, Spark, Aerospike, FoundationDB, and the broader Hadoop ecosystem.
  • Deep experience with Terraform, CloudFormation, or related IaC tooling, and the ability to guide teams in IaC best practices.
  • Proven incident management leadership in production SaaS systems, including on call excellence, postmortem execution, and long-term reliability improvements.
  • Exceptional problem solving skills and the ability to lead complex investigations across multiple system layers.
  • Strong communication, cross-functional leadership, and ability to influence engineering best practices.
  • Hands-on expertise with ArgoCD, GitOps workflows, and CI/CD architectures.

Responsibilities

  • Architect, design, and evolve complex distributed systems to improve reliability, operational efficiency, and performance at scale.
  • Partner closely with product, security, and data engineering teams to translate business needs into resilient and scalable system designs.
  • Drive reliability through automation and advanced observability, ensuring proactive detection, reduced mean time to recovery, and consistent system hygiene.
  • Lead and mentor in high stakes situations, owning debugging efforts for critical issues and establishing durable prevention strategies.
  • Perform deep infrastructure cost audits, identifying areas of inefficiency and implementing solutions that reduce waste without compromising performance or security.
  • Own and maintain key distributed data platforms, including Aerospike and FoundationDB, ensuring durability, consistency, and performance.
  • Guide teams in defining SLIs/SLOs and operational best practices, elevating system reliability and engineering rigor across the org.
  • Continuously identify and eliminate bottlenecks, improving system throughput, latency, and overall efficiency.
  • Champion Infrastructure as Code (IaC) to automate provisioning, configuration, and lifecycle management using modern IaC tools and principles.
  • Lead our GitOps and deployment strategy using Argo CD to implement secure, repeatable, and scalable delivery workflows across Kubernetes environments.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service