Senior Software Engineer, Site Reliability

Benchmark Education Company

16h

About The Position

We are seeking a Senior Software Engineer, Site Reliability to help improve the reliability, scalability, and performance of our cloud-based systems. In this role, you will contribute to feature development, support critical production environments, improve observability, and promote operational excellence. You will have opportunities to mentor junior engineers, lead smaller initiatives, and collaborate with engineering teams to instill reliability best practices across the organization.

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering with a focus on production operations.
Strong knowledge of AWS cloud services and cloud-native architectures.
Proficiency in scripting or programming languages (e.g., Python, Bash).
Experience with observability tools (e.g., CloudWatch, Datadog, Prometheus, Grafana).
Familiarity with infrastructure-as-code tools (e.g., Terraform, CloudFormation) and CI/CD pipelines.
Strong problem-solving skills and ability to work cross-functionally.
Some experience mentoring or coaching junior engineers.

Nice To Haves

AWS certifications (e.g., AWS Certified Solutions Architect – Associate or AWS Certified DevOps Engineer – Associate).
Experience leading on-call rotations, capacity planning, and chaos engineering initiatives.
Experience with containerization (Docker, ECS, Kubernetes/EKS).
Familiarity with incident response best practices and operational readiness processes.
Knowledge of PHP or Java is a plus.

Responsibilities

Contribute to the design, development, and delivery of features that enhance system reliability and scalability.
Define, measure, and improve SLIs, SLOs, and error budgets in collaboration with engineering teams.
Participate in building a culture of reliability through knowledge sharing, documentation, and process improvements.
Implement and improve observability tooling and practices to monitor the health and performance of production systems.
Participate in incident management, including on-call rotations, root cause analysis, and postmortem reviews.
Lead smaller initiatives or components of larger projects, ensuring technical quality and operational readiness.
Collaborate with software engineering, security, and product teams to ensure resilient and secure system design.
Mentor junior engineers, sharing expertise in SRE principles and AWS best practices.
Contribute to automation efforts to reduce toil and improve efficiency of operational processes.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume