Senior Software Engineer (Platform Data Reliability & Automation)

PlayStation Global•San Diego, CA

1d•$177,300 - $265,900•Hybrid

About The Position

Sony Interactive Entertainment (SIE) is seeking a Senior Software Engineer focused on Platform Data Reliability and Automation to join their world-class engineering team. This role is crucial for building, automating, and operating scalable data platforms, with a strong emphasis on Infrastructure as Code (IaC) and cloud technologies. The position focuses on the reliability and automation of NoSQL, Streaming, and Caching services across AWS and GCP environments. The engineer will design robust automation frameworks, ensure high availability, and collaborate with product and platform teams to deliver resilient infrastructure supporting billions of transactions and millions of players globally. By embracing Development & DBRE principles, driving automation-first practices, and applying AI/ML where applicable, the role aims to enhance system uptime, reduce manual toil, and enable velocity for engineering teams across PlayStation. The contributions will directly support the reliability, scalability, and operational excellence of the data platform powering millions of players worldwide.

Requirements

Bachelor's or Master's degree in Computer Science or a related field
6+ years of software development and DBRE experience, with at least 3+ years focused on Go and Infrastructure As Code with an emphasis on automation.
Deep proficiency in Go (Golang), with the ability to write performant, idiomatic, and maintainable code for production-scale systems
Proven experience designing modular, domain-driven architectures in Go, supporting large and complex backend services
Expertise with infrastructure-as-code tools such as Terraform, Ansible.
Deep expertise operating large-scale NoSQL, caching and streaming platforms (Apache Kafka, Redis, AWS MSK, etc) including tuning, compaction strategies, repair operations, backup/recovery, and performance optimization.
Solid understanding of Linux internals, networking, and storage systems.
Experience building, deploying and operating stateful workloads on Kubernetes, including automation and lifecycle management of database and streaming platforms.
Hands-on experience with AWS and/or GCP, including managed services such as MSK, DynamoDB, ElastiCache, or equivalent technologies.
Strong problem-solving and analytical skills, with a passion for automation and distributed systems reliability.
Excellent communication and collaboration skills, with experience mentoring and influencing peers across diverse teams
Experience building internal developer platforms, self-service infrastructure, or platform engineering solutions that improve developer productivity and operational efficiency.

Nice To Haves

Prior use of Go for infrastructure automation, control plane services, or SRE-focused tooling is a plus
Experience leveraging AI/ML and Generative AI technologies to improve infrastructure automation, operational workflows, incident management, observability, or developer productivity is a huge plus.
Certification in relevant technologies (e.g., AWS Certified Database - Specialty) is a plus

Responsibilities

Design and implement Infrastructure as Code (IaC) and automate the provisioning, monitoring, scaling, and lifecycle management of NoSQL, Streaming, and Caching platforms (e.g., Cassandra, Aerospike, Kafka, Redis).
Drive end-to-end automation to enable repeatable, reliable, and self-service deployment of data services across cloud and hybrid environments.
Ensure high availability, scalability, and resiliency of the platform data solutions.
Define and enforce SLIs, SLOs, and error margins for data platforms to drive reliability engineering practices.
Build highly performant, self-healing systems, automated failover, and auto scaling solutions for databases and streaming platforms.
Develop observability solutions (metrics, logging, tracing) for Cassandra, Aerospike, Redis, and Kafka/MSK to ensure proactive issue detection.
Partner with engineering and platform teams to provide reliable, scalable, and performant data services.
Lead incident response for critical database/caching/streaming issues and drive root cause analysis with permanent automated fixes.
Explore and apply AI-driven approaches to automation (e.g., anomaly detection, predictive scaling, automated remediation) to enhance operational efficiency.
Drive and implement best practices, procedures, operational playbooks to facilitate knowledge sharing and support continuous improvement across global teams.
Mentor junior engineers and influence best practices in automation, distributed systems, and database reliability.