Principal DevOps Engineer

SciPlay•St. Cloud, MN

18d

About The Position

SciPlay is a leading developer and publisher of digital games on mobile and web platforms, providing highly entertaining free-to-play games that millions of people play every day for their authenticity, engagement and fun. SciPlay currently offers nine core games, including social casino games Jackpot Party Casino, Gold Fish Casino Slots, Hot Shot Casino and Quick Hit Slots, and casual games MONOPOLY Slots, Bingo Showdown, 88 Fortunes Slots, Backgammon Live and Solitaire Pet Adventure with offices all over the world! Position Summary This role: We’re looking for someone with passion for and experience with building reliable, performant, secure and observable infrastructure for our live products. As a Senior Site Reliability Engineer on our Tech Ops team, you’ll operate our high traffic systems and keep them running for our partner teams and also innovate to continually improve our legacy systems. You’ll identify vulnerabilities in our infra and work with the team to secure them. You’ll improve our observability so stakeholders can find everything they’re looking for in a single pane of glass. You’ll automate clickops to reduce toil for the team and streamline patching and security updates. And when there's an incident, you’ll be there through the process, keeping our stakeholders informed while finding solutions now and mitigating the problem in the future. Who you are: We’re looking for someone eager to innovate and challenge the status quo. You’re willing to take risks and own outcomes. You embrace working with others to overcome challenging problems. You speak up when you have ideas and welcome debate. You learn what you need to know, teach what you already know, and encourage the same in others. You set high standards and influence internal partners to build best-in-class platforms. Who we are: We’re a team of huge nerds distributed around the world working on SciPlay’s modern platform for our games – our Tech Ops team is just part of it. We thrive when we solve problems together and when we challenge each other. We strive to continually improve our skills and our culture, both in our team and throughout the company. We build empowered product teams with autonomy to make decisions and experiment to solve business problems, and we provide a cushion when things go sideways. It’s hard, and we love it. What we’ll do together: We’ll solve cool problems. We’ll build new products. We’ll develop new features. The Tech Ops crew primarily works in Linux and AWS using a mix of Terraform, Ansible, Python and PHP, and manages CouchBase and MariaDB DBs. Our CI/CD pipelines use Jenkins and ADO. Have some of that but not all of it? Our team will help you fill in the gaps. We care more about how you think and solve problems than syntax. If we sound like the kind of people you want to work with, let’s talk.

Requirements

7+ years of experience in a Site Reliability Engineer Role or similar
Extensive experience with Linux systems admin and networking fundamentals
Strong experience working in a Live Services environment and proven ability to maintain highly available distributed systems.
Hands-on proficiency with the Grafana stack and strong observability fundamentals.
Proficiency in one or more programming/scripting languages, such as Python, Go or Ruby.
Experience automating patching and configuration management with tools such as Terraform, Ansible and Python scripting.
Experience with using Service-Level Objectives to achieve technical success
Excellent troubleshooting, root cause analysis, and communication skills.

Nice To Haves

Experience with Couchbase, F5 load balancers and Jenkins CI/CD pipelines
Exposure to container orchestration (Docker/Kubernetes) and modern service mesh patterns

Responsibilities

Implement monitoring solutions to proactively identify issues and potential performance bottlenecks.
Setup and configure alerting systems to notify the team of anomalies and incidents.
Collaborate with software development teams to design systems that are reliable, scalable, and observable.
Share ownership of incident management and root cause analysis, identifying and addressing the underlying cause of incidents, and preventing their recurrence.
Communicate results out and up to keep stakeholders informed during and after incidents.
Lead security initiatives to keep our cloud and on-prem systems patched and secure without interrupting service to our customers.
Work with stakeholders to create patching policies and schedules to minimize disruptions to our development teams.
Build out and improve CI/CD pipelines
Participate in new initiatives such as building cloud services, new client technologies, containerizing legacy systems, etc.
Champion automation to reduce toil and increase development velocity
Define and instrument Service-Level Objectives that lead to a quality player experience
Maintain clear and up-to-date documentation for configurations, processes and best practices