Sr. Site Reliability Engineer
Platform Science
·
Posted:
August 1, 2023
·
Hybrid
About the position
We are seeking a qualified Senior SRE to join our team at Platform Science. In this role, you will be responsible for solving operational problems and providing support to development teams for critical business applications in production. Our focus is on ensuring reliability in all production services and enabling dev teams to measure their reliability for effective decision-making. As a Senior SRE, you should have a software development or systems background with strong coding skills. If you are excited about working with new technologies, supporting various products, and collaborating with a talented team, then this position is for you.
Responsibilities
- Collaborate with teams to architect, engineer, and optimize products for Kubernetes and the cloud
- Create and enhance Continuous Integration/Continuous Deployment (CI/CD) pipelines, release management processes, and tools
- Maintain observability tools and promote standardization and best practices for development teams
- Build tools, automation, and frameworks to improve system stability and reliability
- Lead the effort in promoting and prioritizing reliability, driving achievement of uptime goals and mentoring colleagues in SRE best practices
- Provide oncall support to development teams for critical business applications in production
- Play an active role in facilitating an SRE guild, contributing to its operation and ensuring the sharing of knowledge and collaboration among members
- Conduct comprehensive Production Readiness Reviews, working with teams to identify and establish Service Level Objectives (SLOs), and ensure high-quality and dependable services
- Write and contribute to project plans, engineering level documentation, and develop operational excellent standard operating procedures and runbooks with a focus on automation
Requirements
- 5+ years of experience in SRE or Platform Engineer role supporting a 24x7 production environment
- 3+ years AWS or comparable cloud resource administration/support in a production environment
- Strong expertise in Kubernetes administration, containerization tools (e.g., Docker), and Helm, adhering to industry best practices such as GitOps
- Proficiency in scripting languages such as Python, Ruby, Bash, Node.js, and/or Go
- Experience with distributed tracing and proficient with one or more of the following monitoring solutions: Prometheus, Elasticsearch, Datadog, and Cloudwatch
- Demonstrated proficiency with current software development lifecycle (SDLC) concepts and best-practices, CI/CD pipelines, and test-driven development
- Strong problem-solving and operational skills
- Automation advocate
Benefits
- Medical, dental, and vision insurance
- Short-term and long-term disability insurances
- AD&D and life insurance
- 401k plan
- Paid vacation, sick leave, and holidays
- Six weeks of paid parental leave