Principal DevOps Engineer

SciPlaySt. Cloud, MN
1d

About The Position

SciPlay is a leading developer and publisher of digital games on mobile and web platforms, providing highly entertaining free-to-play games that millions of people play every day for their authenticity, engagement and fun. SciPlay currently offers nine core games, including social casino games Jackpot Party Casino, Gold Fish Casino Slots, Hot Shot Casino and Quick Hit Slots, and casual games MONOPOLY Slots, Bingo Showdown, 88 Fortunes Slots, Backgammon Live and Solitaire Pet Adventure with offices all over the world! Position Summary This role: We’re looking for someone with passion for and experience with building reliable, performant, secure and observable infrastructure for our live products. As a Senior Site Reliability Engineer on our Tech Ops team, you’ll operate our high traffic systems and keep them running for our partner teams and also innovate to continually improve our legacy systems. You’ll identify vulnerabilities in our infra and work with the team to secure them. You’ll improve our observability so stakeholders can find everything they’re looking for in a single pane of glass. You’ll automate clickops to reduce toil for the team and streamline patching and security updates. And when there's an incident, you’ll be there through the process, keeping our stakeholders informed while finding solutions now and mitigating the problem in the future. Who you are: We’re looking for someone eager to innovate and challenge the status quo. You’re willing to take risks and own outcomes. You embrace working with others to overcome challenging problems. You speak up when you have ideas and welcome debate. You learn what you need to know, teach what you already know, and encourage the same in others. You set high standards and influence internal partners to build best-in-class platforms. Who we are: We’re a team of huge nerds distributed around the world working on SciPlay’s modern platform for our games – our Tech Ops team is just part of it. We thrive when we solve problems together and when we challenge each other. We strive to continually improve our skills and our culture, both in our team and throughout the company. We build empowered product teams with autonomy to make decisions and experiment to solve business problems, and we provide a cushion when things go sideways. It’s hard, and we love it. What we’ll do together: We’ll solve cool problems. We’ll build new products. We’ll develop new features. The Tech Ops crew primarily works in Linux and AWS using a mix of Terraform, Ansible, Python and PHP, and manages CouchBase and MariaDB DBs. Our CI/CD pipelines use Jenkins and ADO. Have some of that but not all of it? Our team will help you fill in the gaps. We care more about how you think and solve problems than syntax. If we sound like the kind of people you want to work with, let’s talk.

Requirements

  • 7+ years of experience in a Site Reliability Engineer Role or similar
  • Extensive experience with Linux systems admin and networking fundamentals
  • Strong experience working in a Live Services environment and proven ability to maintain highly available distributed systems.
  • Hands-on proficiency with the Grafana stack and strong observability fundamentals.
  • Proficiency in one or more programming/scripting languages, such as Python, Go or Ruby.
  • Experience automating patching and configuration management with tools such as Terraform, Ansible and Python scripting.
  • Experience with using Service-Level Objectives to achieve technical success
  • Excellent troubleshooting, root cause analysis, and communication skills.

Nice To Haves

  • Experience with Couchbase, F5 load balancers and Jenkins CI/CD pipelines
  • Exposure to container orchestration (Docker/Kubernetes) and modern service mesh patterns

Responsibilities

  • Implement monitoring solutions to proactively identify issues and potential performance bottlenecks.
  • Setup and configure alerting systems to notify the team of anomalies and incidents.
  • Collaborate with software development teams to design systems that are reliable, scalable, and observable.
  • Share ownership of incident management and root cause analysis, identifying and addressing the underlying cause of incidents, and preventing their recurrence.
  • Communicate results out and up to keep stakeholders informed during and after incidents.
  • Lead security initiatives to keep our cloud and on-prem systems patched and secure without interrupting service to our customers.
  • Work with stakeholders to create patching policies and schedules to minimize disruptions to our development teams.
  • Build out and improve CI/CD pipelines
  • Participate in new initiatives such as building cloud services, new client technologies, containerizing legacy systems, etc.
  • Champion automation to reduce toil and increase development velocity
  • Define and instrument Service-Level Objectives that lead to a quality player experience
  • Maintain clear and up-to-date documentation for configurations, processes and best practices

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service