Senior Sire Reliability Engineer

CertifID•Austin, TX

About The Position

We are seeking a Senior Site Reliability Engineer (Senior SRE) to drive reliability improvements across our production SaaS environment. Youâll play a critical role in building scalable infrastructure patterns, advancing observability, improving incident response, and partnering with engineering teams to embed reliability into system design and delivery. This role is ideal for an experienced Sr. SRE who enjoys solving complex operational problems, building automation, and mentoring others.

Requirements

Experience: 5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering.
Cloud Expertise: Proven experience supporting production SaaS systems in Azure (preferred), AWS, or GCP.
Technical Stack: Strong Linux, networking, and distributed systems troubleshooting skills.
Containers: Strong experience with containers and orchestration (Kubernetes/EKS/AKS).
IaC & Tooling: Expertise with Infrastructure-as-Code (Terraform strongly preferred).
Programming: Strong scripting/programming skills in Python, Go, Bash, or C#/.NET.
Observability: Hands-on experience with Datadog, Prometheus/Grafana, or OpenTelemetry.

Responsibilities

Reliability & Platform Operations: Own and improve the reliability, availability, and performance of production systems while defining and operationalizing SLIs/SLOs and error budgets.
AI Agent Enablement: Design and implement autonomous and semi-autonomous AI agents for monitoring distributed systems and applications. Build agents capable of consuming multi-source observability data (metrics, logs, traces, etc.).
Incident Response: Participate in and help lead an on-call rotation, serving as an escalation point for major incidents and facilitating blameless postmortems.
Automation & Infrastructure: Build automated workflows to eliminate manual work and design/maintain Infrastructure-as-Code with Terraform.
Observability: Improve metrics, logs, traces, and alerting using tools like Datadog or Prometheus to reduce noise and increase signal.
Collaboration & Mentorship: Partner with application teams to implement reliability best practices and mentor junior engineers to foster a culture of knowledge sharing.