Brightstar Lottery - Cloud/Site Reliability Engineer (17859)

The City of ProvidenceProvidence, RI
19d$117,880 - $240,000Hybrid

About The Position

We are seeking a Cloud/Site reliability Engineer to join our Cloud Infrastructure Engineering, Operations & Automation team. This role is designed for engineers who are passionate about building resilient systems, preventing incidents before they occur, and driving operational excellence through intelligent monitoring, AI-driven automation, and continuous improvement. You’ll play a pivotal role in evolving our cloud-hosted environments to be more self-aware, self-healing, and scalable, ensuring high availability and performance of our applications and services, and contributing with your investigation on issues that are meant to facilitate the engagement of L3 product engineers in case of production incidents.

Requirements

  • Hands-on experience in cloud operation or site reliability engineering field
  • Practical experience in public cloud infrastructure and services management (Azure / AWS public cloud knowledge would be preferred)
  • Proficiency in scripting and automation (Terraform, PowerShell, Python, Bash).
  • Experience with Infrastructure as Code (IaC) and GitOps principles
  • Hands-on experience on K8s and containers orchestration
  • Expertise in monitoring tools (Dynatrace, Datadog, Prometheus, ELK).
  • Strong analytical, troubleshooting, and communication skills.

Nice To Haves

  • Apply Agentic AI techniques to drive intelligent automation, optimize cloud services, accelerate troubleshooting and root-cause analysis, and enhance system resilience and recoverability.
  • Familiarity with AI/ML Ops or AI-assisted observability tools
  • Thorough understanding of Java application workloads, and Java performance related topics
  • Deep knowledge of one programming language (Java/ Python / Go)
  • Strong Linux and networking skills
  • Understanding software architecture patterns and app-dev principles
  • Public cloud certifications would be considered as a plus
  • Experience in a 24/7 operations environment.

Responsibilities

  • Design and refine monitoring strategies using tools like Dynatrace, Prometheus, and ELK.
  • Develop alerting standards that reduce noise and increase signal quality.
  • Continuously improve observability to detect anomalies before they impact users
  • Assess application workloads key metrics for performance and reliability, together with infrastructure and middleware monitoring
  • Identify Public/Hybrid Cloud issues in services and resources.
  • Correlate alerts with telemetry and logs to identify systemic issues and improvement opportunities.
  • Work with L3 product engineers and with cloud vendors towards the resolution of the cases
  • Design, build, and maintain robust automation pipelines using tools such as Terraform, Ansible, Jenkins, Helm, and Bash to streamline cloud operations.
  • Develop and implement self-healing capabilities that proactively detect and remediate issues, minimizing manual intervention and downtime.
  • Analyze operational workflows to identify repetitive tasks and transform them into scalable, automated solutions.
  • Collaborate with the Architecture team to enhance and enforce cloud baseline standards for consistency and reliability.
  • Automate incident response and recovery processes leveraging tools like PagerDuty to accelerate resolution and improve system resilience.
  • Advanced experience with both Azure and AWS cloud service providers.
  • Manage Cloud infrastructure and services
  • Monitor and optimize Cloud resources usage.
  • Open and manage Microsoft support tickets in collaboration with L3.
  • Participate in 24x7 On-Call rotation with after-hours support for critical incident response.

Benefits

  • Base pay is only one part of our Total Rewards program.
  • Sales roles may be eligible for commission payments, while other roles are eligible for discretionary bonuses.
  • In addition, we offer employees a 401(k) Savings Plan with Company contributions, health, dental, and vision insurance, life, accident, and disability insurance, tuition reimbursement, paid time off, wellness programs, and identity theft insurance.
  • Note: programs are subject to eligibility requirements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service