Director, Site Reliability Engineering

EarnInMountain View, CA
6h$315,000 - $385,000Hybrid

About The Position

As one of the first pioneers of earned wage access, our passion at EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. Our community members access their earnings as they earn them, with options to spend, save, and grow their money without mandatory fees, interest rates, or credit checks. We’re fortunate to have an incredibly experienced leadership team, combined with world-class funding partners like A16Z, Matrix Partners, DST, Ribbit Capital, and a very healthy core business with a tremendous runway. We’re growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of our growth journey. The Director of Site Reliability Engineering (SRE) will provide strategic leadership and technical direction for the reliability, scalability, and performance of our mission‑critical systems and services. This role combines deep SRE expertise with strong engineering leadership to drive organizational transformation toward reliability-first principles. The ideal candidate brings a strong software engineering foundation, a passion for automation, and a proven ability to develop and lead high‑performing teams. The Director will partner with engineering, product, operations, and business stakeholders to design, deliver, and operate resilient, high‑availability systems that support our customers and business objectives at scale. The Mountain View base salary range for this full-time position is $315,000 to $385,000, plus equity and benefits. Our salary ranges are determined by role, level, and location. This is a hybrid position in Mountain View, requiring in-office work 2 days a week.

Requirements

  • BS, MS, or PhD degree in Computer Science, Engineering, or related field, or related experience
  • 7+ years of experience in the field, including 3+ years leading SRE teams or a team in a similar role.
  • Strong experience with container orchestration (Kubernetes), infrastructure as code (Terraform), and CI/CD pipelines.
  • Hands-on experience with observability platforms (e.g., Datadog, Prometheus, Grafana) and incident management tools (e.g., incident.io, PagerDuty).
  • Proficiency in at least one programming language (Python, Go, or Java) with the ability to review code and guide system design decisions.
  • Proven experience in architecting and managing highly available, scalable, and fault-tolerant systems.
  • Ability to define a clear reliability vision and inspire teams and stakeholders toward long‑term reliability goals.
  • Demonstrated sound judgment and calm decision‑making under pressure, particularly during high‑severity incidents.
  • Strong people leadership skills, with experience coaching and mentoring engineering talent, developing future leaders, and aligning peer engineering managers and leaders on reliability best practices.
  • Strategic planning skills with a track record of aligning technical direction with organizational objectives.
  • Excellent communication skills; able to translate complex technical issues into clear, actionable insights for executive and non‑technical audiences.
  • Highly collaborative, with the ability to work effectively across engineering, product, operations, and business functions and leaders.

Responsibilities

  • Drive organizational transformation toward SRE principles and own the strategic direction for reliability maturity, cultivating a culture centered on reliability, efficiency, and continuous improvement.
  • Develop and oversee automation strategies, tools, and frameworks that improve system reliability, reduce operational toil, and enhance team productivity.
  • Architect and evolve robust observability, monitoring, and alerting systems; champion chaos engineering and resilience testing practices to proactively validate system behavior under failure conditions.
  • Partner with engineering, product, and operations teams to embed SRE practices throughout the development lifecycle and influence architectural decisions for reliability.
  • Build, mentor, and develop a high‑performing global SRE organization, fostering technical excellence, career growth, and a strong culture of knowledge sharing.
  • Oversee capacity planning, scalability assessments, and future‑state demand forecasting across critical systems.
  • Lead and govern high‑severity incident response practices—ensuring rapid triage, thorough root cause analysis, and follow‑through on corrective and preventative actions

Benefits

  • equity
  • benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service