The Site Reliability Engineer will ensure the reliability, performance, and scalability of the Client System. The engineer will define and track Key Performance Indicators and Service Level Objectives, identify and resolve performance bottlenecks, and perform root cause analysis on incidents to implement preventative measures and enhance system efficiency. This role will design and implement monitoring and alerting systems to provide visibility into system health and performance, develop and maintain runbooks and playbooks for operational procedures, and automate routine operational tasks to improve efficiency and reduce human error. The engineer will conduct capacity planning to ensure the system can handle expected loads, implement load testing to validate system performance under stress, and establish disaster recovery procedures to minimize downtime. They will participate in on-call rotations to respond to system incidents, collaborate with development teams to improve application reliability and performance, and implement chaos engineering practices to identify weaknesses before they impact users.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior