Senior Site Reliability Engineer

Penn MutualPhiladelphia, PA
18h$128,000 - $165,000

About The Position

Penn Mutual is seeking a Senior Site Reliability Engineer (Senior SRE) to help evolve reliability practices across business-critical systems in a highly regulated financial services environment. This role is a hands-on technical leader responsible for designing, implementing, and advancing reliability across complex, distributed systems. You will deliver reliability strategy through automation, observability, and architectural guidance while remaining hands-on in development, release, and incident triage. Senior SREs operate with significant autonomy, tackle ambiguous problems, and influence system design upstream with the overall goal of delivering change in reliable and measurable ways to key stakeholders.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field.
  • 6–10+ years of experience in SRE, software engineering, platform, or DevOps roles.
  • Professional experience in performing root cause analysis on incidents, documenting SRE systems and usage.
  • Strong programming skills with professional experience in multiple languages.
  • Deep experience with AWS and distributed systems.
  • Advanced knowledge of observability, ITSM, and reliability engineering principles.
  • Proven ability to operate effectively in complex, regulated environments.
  • Experience with use/implementation of observability tools (metrics, logs, tracing)
  • Experience with CI/CD pipelines and deployment automation.
  • Experience with Root Cause Analysis investigation/documentation
  • Familiarity with containerization and orchestration technologies.
  • Strong troubleshooting and analytical skills.

Nice To Haves

  • Experience with IaC (CloudFormation).
  • Experience with application frameworks (Spring, Spring Boot, React, Angular)
  • Experience with application servers/containers (Tomcat, Netty, Node.JS, Next.JS)
  • Experience with relational and non-relational databases and related ORM/drivers.
  • Experience working in Agile/Scrum environments.
  • Experience with ITSM tools (ServiceNow or similar).
  • Experience with ITIL-aligned change and release processes.
  • Familiarity of security compliance frameworks (ISO 27001, SOC 2).

Responsibilities

  • Lead reliability availability, scalability, and recovery design for critical systems.
  • Define and evolve SLOs, SLIs, and error budget practices across services.
  • Identify systemic reliability risks and drive cross-team remediation efforts.
  • Influence application and platform architecture to improve operational outcomes.
  • Act as a technical lead during major incidents and complex outages.
  • Drive high-quality root cause analysis and recommend corrective actions.
  • Improve incident response processes, tooling, and runbooks.
  • Design and implement advanced automation to eliminate operational toil at scale.
  • Build and maintain shared SRE tooling and platforms.
  • Set engineering standards for reliability-focused code and operational practices.
  • Review and improve CI/CD, deployment, and rollback strategies.
  • Partner with Release and Change Management to automate release practices.
  • Lead risk assessments for high impact changes and releases.
  • Ensure compliance requirements are met without sacrificing engineering velocity.
  • Serve as a reliability authority for release readiness decisions.
  • Mentor junior SREs and junior engineers through technical guidance and review.
  • Lead by example in operational excellence and engineering rigor.
  • Influence reliability culture across engineering and product teams.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service