Sr. Site Reliability Engineer

Apply

Platform Science

Posted:

August 1, 2023

Hybrid

Job Commitment

Full-time

Experience Level

Senior

Workplace Type

Hybrid

Job Function

Dev & Engineering

This job is closed

We regret to inform you that the job you were interested in has now been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

About the position

We are seeking a qualified Senior SRE to join our team at Platform Science. In this role, you will be responsible for solving operational problems and providing support to development teams for critical business applications in production. Our focus is on ensuring reliability in all production services and enabling dev teams to measure their reliability for effective decision-making. As a Senior SRE, you should have a software development or systems background with strong coding skills. If you are excited about working with new technologies, supporting various products, and collaborating with a talented team, then this position is for you.

Responsibilities

Collaborate with teams to architect, engineer, and optimize products for Kubernetes and the cloud
Create and enhance Continuous Integration/Continuous Deployment (CI/CD) pipelines, release management processes, and tools
Maintain observability tools and promote standardization and best practices for development teams
Build tools, automation, and frameworks to improve system stability and reliability
Lead the effort in promoting and prioritizing reliability, driving achievement of uptime goals and mentoring colleagues in SRE best practices
Provide oncall support to development teams for critical business applications in production
Play an active role in facilitating an SRE guild, contributing to its operation and ensuring the sharing of knowledge and collaboration among members
Conduct comprehensive Production Readiness Reviews, working with teams to identify and establish Service Level Objectives (SLOs), and ensure high-quality and dependable services
Write and contribute to project plans, engineering level documentation, and develop operational excellent standard operating procedures and runbooks with a focus on automation

Requirements

5+ years of experience in SRE or Platform Engineer role supporting a 24x7 production environment
3+ years AWS or comparable cloud resource administration/support in a production environment
Strong expertise in Kubernetes administration, containerization tools (e.g., Docker), and Helm, adhering to industry best practices such as GitOps
Proficiency in scripting languages such as Python, Ruby, Bash, Node.js, and/or Go
Experience with distributed tracing and proficient with one or more of the following monitoring solutions: Prometheus, Elasticsearch, Datadog, and Cloudwatch
Demonstrated proficiency with current software development lifecycle (SDLC) concepts and best-practices, CI/CD pipelines, and test-driven development
Strong problem-solving and operational skills
Automation advocate