Site Reliability Engineer

Basata Inc•Tempe, AZ

1d•Onsite

About The Position

We're looking for our first dedicated Site Reliability Engineer (SRE) to own reliability as we grow. This is a build role, not a maintenance one. Our infrastructure today is deliberate and well-structured infrastructure-as-code, containerized services, clear conventions, and a defined deployment approach. We've built a solid foundation, and the next phase is designing the reliability practice, tooling, and architecture that will carry us from serving our current clinics to serving many times more. You'll define how we do SRE here, set the standards, and have real ownership over a domain that directly determines whether clinics can trust us with their work.

Requirements

Strong software engineering fundamentals—you write code to solve operational problems, not just configure systems. Our stack spans Java and Python on the backend with TypeScript on the frontend, and you'll work across it.
Real experience running production systems: containerized services, cloud infrastructure, and infrastructure-as-code.
Depth in observability and incident response—you've built monitoring and alerting that catches problems early, and you've led real incidents to resolution under pressure.
The ability to pick up an unfamiliar codebase, reason about how it behaves in production, and identify failure modes—because you'll need to understand our system to keep it reliable.
Experience designing for reliability at the architecture level: capacity planning, scaling strategies, failure isolation, and safe deployment practices.
Calm, structured judgment during incidents, and the discipline to turn each one into a lasting improvement.
Comfort owning an ambiguous, greenfield mandate—you're energized by defining a practice from the ground up rather than inheriting a finished one.

Nice To Haves

Experience as an early or first SRE hire, or building a reliability function from scratch.
Experience scaling systems through stages of rapid growth.
Healthcare, regulated-industry, or other high-stakes-reliability background—where downtime and data handling carry real consequences.

Responsibilities

Own the reliability, availability, and performance of our production platform—define our SLOs, build the observability to measure against them, and drive the work to meet them.
Establish our incident response practice end to end: triage, mitigation, resolution, and blameless postmortems that actually prevent recurrence.
Design and build the next generation of our infrastructure and deployment systems as we scale—evolving our infrastructure-as-code, deployment pipeline, and operational tooling.
Reduce operational toil through automation, so reliability scales faster than headcount.
Work closely with our engineers to make services more operable—better instrumentation, graceful degradation, and designs that hold up under failure. This means reading and contributing to application code, not just managing it from the outside.
Set the operational culture and engineering standards for reliability on a small, serious team—and grow the practice as the team grows.

Benefits

Drive real impact. Your work will help clinics run more smoothly and ensure patients get the care they need—faster and with fewer headaches.
Shape something meaningful. From early product decisions to UI details, you'll play a big role in crafting both the code and the overall user experience.
High ownership. Join a team where you’re trusted to lead, build, think critically, and bring ideas to life.
Work with purpose. We’re not here to throw more tech at the wall, we're solving real problems in healthcare with tools that people rely on every day.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume