Site Reliability Engineer

Base Power CompanyAustin, TX
3dOnsite

About The Position

You will define and own how Base runs production systems. As the first Site Reliability Engineer at Base, you will design the technical and operational model for running our services across cloud infrastructure and hardware-connected systems with real-world constraints. This role sits at the boundary between software, infrastructure, and physical systems, where failures are observable, impactful, and sometimes irreversible. This is not a DevOps role.

Requirements

  • Strong coding ability in modern languages; we use Go, but value engineers who can learn and adapt quickly.
  • Experience deploying and operating infrastructure defined through infrastructure as code (e.g., Terraform, Pulumi).
  • Experience designing shared infrastructure and standards that enable teams to operate autonomously, without centralizing ownership or creating bottlenecks.
  • Experience owning and operating production systems with meaningful uptime and reliability requirements.
  • A deep understanding of reliability fundamentals, including failure modes, redundancy, load, and capacity.
  • Hands-on experience building highly observable systems, with clear opinions on metrics, logs, traces, and alerting.
  • Experience diagnosing failures in live systems by grounding decisions in production data rather than intuition.
  • Sound judgment when balancing reliability, development velocity, and system complexity.
  • Comfort taking end-to-end responsibility for systems in a fast-growing, ambiguous environment.

Responsibilities

  • Build and operate services and infrastructure that directly improve the reliability of Base’s production systems.
  • Design and own Base’s observability and operational tooling, including metrics, logs, traces, alerting, and the internal workflows engineers use to understand and operate live systems.
  • Define alerting and on-call standards so pages are actionable, correctly scoped, and tied to real user or system impact.
  • Establish and evolve how Base responds to reliability incidents, including response coordination, root cause analysis practices, and driving permanent fixes at the code, infrastructure, and architecture levels.
  • Own access control and auditability for production systems, including role-based access, permission boundaries, and security logging.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service