Senior Site Reliability Engineer (SRE)

Oowlish Technology

1d•Remote

About The Position

We are looking for an experienced Senior Site Reliability Engineer (SRE) to own the reliability, availability, and operational excellence of business-critical production systems. This is a dedicated Site Reliability Engineering role—not a general DevOps or Infrastructure position. You will define how reliability is measured, lead incident response during production outages, drive observability strategy, and continuously improve operational practices across high-availability environments. The ideal candidate has hands-on experience managing SLOs, leading major incidents, improving on-call operations, and building a strong reliability culture through automation, observability, and continuous improvement.

Requirements

5+ years of experience in Site Reliability Engineering, Production Engineering, Reliability Engineering, or similar roles.
Proven experience operating production systems in high-availability environments.
Hands-on experience defining and managing SLOs, SLIs, and Error Budgets.
Experience leading production incident response and Incident Command.
Strong observability and monitoring experience.
Strong software engineering skills using Python, Go, or TypeScript.
Experience working with cloud platforms.
Strong written and verbal English communication skills.
Proven Site Reliability Engineering experience.
Experience defining and managing: Service Level Indicators (SLIs) Service Level Objectives (SLOs) Error Budgets
Experience leading Incident Command during major production incidents.
Experience conducting blameless postmortems and driving follow-up actions.
Experience designing, maintaining, and improving on-call programs.
Experience developing runbooks and escalation policies.
Strong observability experience, including: Monitoring Logging Alerting Distributed Tracing
Experience tuning alerts to reduce operational noise.
Strong automation skills using Python, Go, or TypeScript.
Experience supporting mission-critical production systems.
Experience working in high-availability production environments.

Nice To Haves

Experience with Datadog.
Experience with AWS.
Experience with Heroku.
Experience working in regulated industries (Healthcare, HIPAA, Financial Services, etc.).
Experience establishing or maturing an SRE practice.
Capacity planning experience.
Disaster recovery planning and execution.
Experience with Kubernetes.
Experience with PostgreSQL or SQL Server.
Experience supporting modern TypeScript-based applications.

Responsibilities

Define, implement, and continuously improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
Develop and maintain observability strategies, including monitoring, logging, tracing, and alerting.
Own observability configuration, instrumentation, and alert optimization.
Lead Incident Command during production incidents and coordinate cross-functional response efforts.
Drive blameless postmortems and ensure corrective actions are completed.
Own and continuously improve the on-call program, including rotations, escalation policies, runbooks, and alert tuning.
Establish production readiness standards for new services.
Partner with engineering teams on capacity planning, scalability, and disaster recovery initiatives.
Automate operational processes and reliability improvements using software engineering best practices.
Continuously improve system reliability, availability, and operational efficiency.