Site Reliability Engineer

Cutover

124d•$120,000 - $130,000

About The Position

Cutover provides enterprise technology operations teams with an AI-powered SaaS solution that automates and streamlines complex processes with intelligent runbooks. The Cutover solution enables teams to respond to incidents quickly, recover from IT outages, and manage cloud migrations with precision and efficiency. Cutover is used in many of the world's largest financial institutions to support their critical technology operations, including 5 out of the top 6 largest asset managers and 3 out of the top 5 US banks. We’re looking for a Site Reliability Engineer (SRE) to add to our US team. This role will report to our SRE Lead. Cutover’s SRE team is responsible for ensuring the reliability and performance levels of our production systems and applications. As a team, we’re committed to constantly improving our engineering culture to maintain a balance between risk and reliability.

Requirements

A genuine excitement for complex problem solving within our tech stack, applying what you know to our unique problems
Familiarity with at least one scripting language such as Ruby, JavaScript, Python, Bash
Experience with containerization (i.e. Docker) or IaC (e.g. Terraform, Helm, CloudFormation)
An eagerness to follow modern engineering practices and learn from others
Familiarity with observability tools such as DataDog, New Relic, Grafana, Prometheus, ELK, or OpenTelemetry
Understanding of core networking concepts (DNS, HTTP/S, Load Balancing, etc.)
A collaborative mindset with clear communication skills
Willing to ask questions to gain a better understanding of new or complex concepts

Nice To Haves

Exposure to major incident response processes
AWS Certified Cloud Practitioner or hands-on experience with cloud environments

Responsibilities

Respond to incidents and alerts, triaging urgency and investigating root cause
Regular contributions to improve our documentation on system design, troubleshooting, best practices, and engineering processes
Contribute to post-mortems and help identify long-term improvements under guidance
Support cross-functional teams during investigations and post-incident reviews
Support and enhance observability tools and techniques by identifying metrics, logging, and alerting improvements
Write and execute simple automation scripts (e.g. Python, Ruby, Bash) to improve reliability and toil reduction
Work on internal tools, pipelines, and IaC solutions to help improve the speed of software delivery and recovery
Work on efforts to enhance the reliability and performance of our application and systems, ensuring optimal uptime and minimal disruptions
Work closely with the development and platform engineering teams to optimize the infrastructure on AWS, ensuring scalability and efficiency

Benefits

Share Options as part of our compensation package
20 days of PTO per year + public holidays
3 volunteer days to use for any charitable/voluntary cause
A top-tier private health insurance package
401k contribution plan
Work from home stipend
A personal learning and development budget through Learnerbly
Globally consistent parental leave approach
Employee Referral Scheme
Multiple Cutover mental health initiatives, including fully subsidised therapy sessions

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

101-250 employees

Site Reliability Engineer

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company