Senior Service Reliability Engineer

PlayStation Global•Aliso Viejo, CA

12h•Hybrid

About The Position

PlayStation is a global leader in entertainment, producing products and services like PlayStation®5, PlayStation®4, PlayStation®VR, PlayStation®Plus, and acclaimed PlayStation software titles. They strive to create an inclusive environment that empowers employees and embraces diversity. The PlayStation brand is part of Sony Interactive Entertainment, a wholly-owned subsidiary of Sony Group Corporation. The Gaming, Developer and Future Technology Group (GDFT) within Sony Computer Entertainment is leading the cloud gaming revolution, putting console-quality video games on any device. The Service Reliability Engineering team plays a significant role in delivering a great cloud gaming experience by influencing design and operational decisions towards the overall stability of the gaming service. SREs focus on overall ownership of production, production code quality, and deployments. The successful candidate will be self-directed and able to participate in decision-making at different levels. SREs are expected to have opinions on the state of the service and provide critical feedback during different phases of the operational lifecycle, ensuring operational readiness and stability throughout the S/W development lifecycle.

Requirements

Minimum of 7+ years working experience in Software Development and/or Linux Systems Administration role.
Strong interpersonal, written and verbal communication skills.
Available to be scheduled in on-call rotation.
Proficient as a Linux Production Systems Engineer, with experience managing large scale Web Services infrastructure.
Development experience in one or more of the following programming languages: Python (preferred), Bash, Go, Java, C++, or Rust.
Experience with at least 3 of the following topics: Distributed data storage at scale (Hadoop, Ceph), NoSQL at scale (MongoDB, Redis, Cassandra), Data Aggregation technologies (ElasticSearch, Kafka), Scaling and running traditional RDBMS (PostgreSQL, MySQL) with High Availability, Monitoring & Alerting (Prometheus, Grafana), and Incident Management toolsets, Kubernetes and/or AWS (deployment and management), Software Distribution (Package management and distribution at scale), Configuration Management (ansible, saltstack, puppet, chef).

Nice To Haves

S/W Performance analysis and load testing (QA or SDET experience: a plus)

Responsibilities

Taking a leadership role in ongoing improvements in Reliability and Scalability
Work closely with SRE Management to define KPIs, processes and drive continuous improvement
Influence the architecture and implementation of solutions within the division
Mentor more junior SRE staff and enable them for success
Act as a voice to represent SRE in the wider organization
Represent the operational scalability of solutions in the wider division
Lead small-scale projects from inception to implementation
Design platform-wide solutions and provide technical leadership during their implementation
Demonstrate a high-level of organizational skills and initiative in the role