Senior Site Reliability Engineer, Incident Response
Box
·
Posted:
April 26, 2023
·
Other
About the position
Box is seeking a Global Senior Site Reliability Engineer to lead their Global Technical Operations and ensure the continuous health, availability, and reliability of their platforms and SaaS offerings. The role involves managing live-site incidents, coordinating with cross-functional teams, and implementing improvements to enhance site and service manageability. The ideal candidate should have extensive experience in production/platform operations, strong technical expertise in Linux systems and networking, and familiarity with ITILv4 Service Lifecycle Management.
Responsibilities
- Own and direct live-site Major Incident Management
- Triage, refine, and verify the Problem Statement
- Notify and coordinate the efforts of appropriate SME resources
- Lead cross-functional Incident Bridges
- Ensure accurate and timely communication to key stakeholders and business entities
- Lead daily Incident and Change ticket reviews
- Coordinate and monitor change windows
- Coordinate with Problem Management on TopOps Issues and action items
- Protect customers, their data, and the availability of all Box services
- Troubleshoot and identify critical problems in a global hybrid cloud architecture
- Provide technical expertise and experience to address issues in 24x7 environments
- Lead daily reviews of planned changes
- Ensure complete and correct documentation of customer-impacting Incident tickets
- Contribute and review Incident postmortems
- Participate in Problem Management scrums and Postmortems
- Lead projects to improve tools and processes related to site and service manageability
- Coordinate regularly with Infosec, Customer Success, Platform, and Dev leaders
- Mentor and train Global NOC and system engineers
- Have large-scale production/platform operations experience
- Be competent in debugging global, distributed Web/API sites
- Have a solid understanding of ITILv4 Service Lifecycle Management and Incident, Change, and Problem Management framework
Requirements
- 5+ years of large-scale production/platform operations experience in a large, SaaS provider environments, preferably as a Major Incident Manager, SRE team leader or Infrastructure (IaaS) or Platform (PaaS) Architecture SME in a Managed Service Provider environment.
- Experience in bare metal, Openstack, and K-8 architectures supporting a large number of SOA-API-based services.
- Exposure to Open Source Service-Meshes, Proxies, Caching, Message Buses (Kafka, MQS), NOSQL (Hbase, Hadoop), MYSQL clusters, and Search environments (SOLR, ES).
- Competence in debugging global, distributed Web/API sites based on Linux systems (Ubuntu, RHL, Centos), BGP, iBGP, and IP Anycast networking in multi-vendor virtualized, Edge and hybrid public cloud architectures.
- Familiarity with common terminologies, processes, and architectures in Linux Open Source environments, as well as a thorough understanding of Virtualization, Containers, and Kubernetes.
- Strong communication and interaction skills with individuals at all levels, from individual-contributors to C-level executives from multiple countries, ethnicities, and backgrounds.
- Command presence and ability to remain calm and collected in highly stressful situations, such as a major service outage.
- Willingness to continuously learn new skills and technologies.
- Bachelor's degree in Computer Science or Information Systems or equivalent technical field, or similar work experience in a large-scale 24/7 production environment supporting critical, real-time applications.
- Flexibility to work different shifts and provide weekend coverage as needed.
- Solid understanding of ITILv4 Service Lifecycle Management, Service Delivery KPIs, SLIs, SLOs, and Incident, Change, and Problem Management framework, terminology, tools (ServiceNow, Remedy).
Benefits
- Pension
- Medical and dental coverage
- Robust wellness program
- 25 days of vacation (plus birthday off)
- Subsidized gym membership
- Free lunch and snacks
- Impressive office location
- Equal opportunity employer
- Respect for diversity and inclusion
- Accommodations available for people with disabilities
- Protection of personal information during application process