Principal Site Reliability Operations Engineer

Roblox•San Mateo, CA

About The Position

As a Senior Site Reliability Operations Engineer on the Reliability Team, you will manage production incidents and improve Roblox's incident processes while reporting to the Senior Operations Manager. You will maintain reliability service-level objectives, drive incidents tenaciously to resolution, and work with service teams towards appropriate action items during the incident postmortem process. If you are passionate about maintaining uptime in a complex distributed environment full of continuous change, you'll be right at home with our Reliability team.You will report to the Senior Manager, Reliability Response.

Requirements

At least 8+ years of experience in a comparable role within a Site Reliability Team.
Advanced knowledge of systems and network infrastructure protocols.
Demonstrated ability in managing, troubleshooting, and resolving incidents in distributed environments.
Experience solving problems.
An ability to distill complex technical issues into clear and concise language.
Familiarity with at least one scripting or programming language to automate routine tasks (Python, Golang, or similar languages preferred).
Bachelor's degree or equivalent experience in Computer Science, Computer Engineering, or a similar technical field
A great communicator; you are able to explain complex systems clearly to stakeholders and fellow engineers.
Able to operate in potentially ambiguous circumstances during a production incident.
Familiar with the interactions of services in a distributed system.
Tenacious towards driving challenging production incidents to resolution.

Responsibilities

Lead and manage production incidents.
Collaborate cross-functionally to troubleshoot and resolve sophisticated technical challenges.
Guide the implementation of incident management processes and procedures, ensuring fast and effective responses to minimize impact.
Continually monitor system health, performance and capacity, proactively addressing potential issues.
Conduct comprehensive post-mortem analysis to ascertain the root cause of incidents and formulate corrective measures.
Contribute substantially to the design and enhancement of system architecture to boost reliability and performance.
Leverage coding skills to automate daily routine tasks and enhance system efficiency.
Serve in the Incident Manager On-Call rotation.
Mentor junior team members.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume