Senior Site Reliability Engineer
Snyk
·
Posted:
August 29, 2023
·
Remote
About the position
You will join a team responsible for the reliability of key customer workflows and the applications that influence Snyk's reliability. Your role will involve pair-programming to improve Snyk's services, establishing SLIs and SLOs for customer workflows, diagnosing factors that threaten SLOs, improving observability and measurement, and creating error budgets. Additionally, you will work on reducing deployment lead times, improving application design, sharing practices and tooling with other teams, implementing capacity management and load testing capabilities, and ensuring monitoring and alerting are customer impact focused. Experience with infrastructure as code, establishing SLIs/SLOs, software/systems engineering, production databases, automation and observability tooling, and operational best practices is required. Experience with Kubernetes is a plus.
Responsibilities
- Pair-programming to collaboratively improve the services that power Snyk
- Establishing SLIs and SLOs for the key customer workflows that your team owns
- Diagnosing the factors that most threaten SLOs and identifying necessary improvements
- Improving observability, measurement and diagnostics for key customer SLIs and SLOs
- Creating and fine tuning error budgets, with dashboards and alerts to monitor them
- Reducing time to recover with faster deployment lead times
- Improving application design to partition workloads by customer criticality
- Sharing the practices and tooling you develop across other engineering teams
- Implementing capacity management and load testing capabilities for core services
- Working with teams to ensure that monitoring and alerting are instrumented to be customer impact focused. The goal is that no one should get out of bed at 3am for non-customer facing issues
- Raising the bar on Production Readiness, Incident response and analysis, and working with R&D teams to meet this bar
- Participating in our on-call rotation (compensated)
Requirements
- Enjoy working as part of a team and teaching others
- Experience with infrastructure as code
- Familiarity with establishing SLIs/SLOs, error budgets, and metrics on a variety of user flows
- Have experience with software engineering and systems engineering
- Experience working with production databases
- Have experience with reducing toil required by internal teams through building and maintaining automation and observability tooling
- Experience of operational best practices including incident response and analysis
- Have experience running and operating software on Kubernetes
Benefits
- Flexible working hours
- Work-from-home allowances
- In-office perks
- Time off for learning and self-development
- Generous vacation and wellness time off
- Country-specific holidays
- 100% paid parental leave for all caregivers
- Health benefits
- Employee assistance plans
- Annual wellness allowance
- Country-specific life insurance
- Disability benefits
- Retirement/pension programs
- Mobile phone allowance
- Education allowance