Senior Site Reliability Engineer

SentinelOne•Washington, DC

57d

About The Position

We are looking for a Senior Site Reliability Engineer (SRE) to join the Site Reliability Engineering team at SentinelOne. This organization’s mission is to keep our uptime promise to our customers by ensuring we meet our SLOs/SLAs, help our engineering teams ship software to our customers fast and with quality, and ensure our customers are successful. We are looking to add a Senior SRE who has experience running incident post-mortems, automating repetitive operational tasks, improving alerting accuracy, and building and refining processes that reduce downtime. You will work closely with cross-functional teams to lead reliability initiatives and bring best practices to our team. We value good written communication skills, data-driven decisions, and a keen eye for continuous improvements. You’ll help simplify, have a passion for new ideas and know how to execute iteratively toward the final goal. We value candor and collaboration.

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, or a related field in cloud native environments
Strong expertise in incident management processes and the ability to lead complex troubleshooting efforts under pressure.
Experience with Kubernetes and container orchestration
Experience with industry standard observability stacks (Prometheus, Grafana, ELK, OpenTelemetry, etc).
Proficiency in Python and Bash scripting to improve operational workflows and incident response
Familiarity with modern CI/CD pipelines and DevOps practices
Excellent communication skills with demonstrated ability to lead and mentor engineers in reliability practices.

Responsibilities

Lead and execute incident management for production issues, ensuring rapid recovery and root cause analysis
Improve and optimize the observability strategy…
Collaborate with application engineering teams to design and implement monitoring solutions that enhance our alerting capabilities and reduce noise
Develop and refine SLOs, SLIs, and SLAs that align with business objectives and customer expectations
Conduct post-incident review, documenting findings and driving follow-up actions to prevent recurrence.
Mentor and support other engineers in incident response, troubleshooting techniques, and reliability best practices.

Benefits

Medical, Vision, Dental, 401(k), Commuter, Health and Dependent FSA
Unlimited PTO
Industry-leading gender-neutral parental leave
Paid company holidays
Paid sick time
Employee stock purchase program
Disability and life insurance
Employee assistance program
Gym membership reimbursement
Cell phone reimbursement
Numerous company-sponsored events including regular happy hours and team-building events