Senior Site Reliability Engineer

PowerPlan•Atlanta, GA

3d•Hybrid

About The Position

This is a principal-level individual contributor role at the heart of our cloud platform’s reliability, scalability, and operational maturity. You will work hands-on across AWS and Azure environments, solving complex production problems while systematically eliminating the manual toil that creates them. The role offers significant autonomy, deep technical impact, and the opportunity to shape how reliability engineering is practiced across the organization. COMPANY PowerPlan operates a growing SaaS platform supporting enterprise customers with mission-critical workloads. We run complex, multi-cloud environments and value engineers who take ownership, think in systems, and build solutions that scale. Our culture emphasizes operational excellence, blameless learning, and collaboration across Engineering, Support, Professional Services, and Product teams.

Requirements

Deep hands-on experience operating production systems in AWS and Azure environments
Strong automation skills using Python and PowerShell in operational contexts
Proven ability to identify repetitive operational work and eliminate it through automation
Experience leading incident response and blameless post-incident reviews
Strong observability expertise, particularly with Grafana and SLI/SLO-driven monitoring
Ability to influence engineering practices without formal authority
Clear written and verbal communication skills across technical and non-technical audiences

Responsibilities

Resolve escalated infrastructure cases across major AWS and Azure services and deliver 2–3 targeted automations that measurably reduce manual resolution time for recurring issues within 90 days.
Eliminate or significantly reduce manual intervention for the top 5–7 highest-frequency operational issues through automation, self-service tooling, or infrastructure improvements within 3–6 months.
Establish a consistent, high-quality incident response and post-incident review process resulting in faster containment, clearer ownership, and tracked corrective actions for all critical production incidents by month 9.
Deliver a mature observability layer across AWS and Azure with service-level dashboards, tuned alerts, and clear SLI/SLO reporting actively used by on-call and engineering teams by month 12.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume