Site Reliability Engineer / DevOps Engineer

Global Payments Inc.•Alpharetta, GA

About The Position

Every day, Global Payments makes it possible for millions of people to move money between buyers and sellers using our payments solutions for credit, debit, prepaid and merchant services. Our worldwide team helps over 3 million companies, more than 1,300 financial institutions and over 600 million cardholders grow with confidence and achieve amazing results. We are driven by our passion for success and we are proud to deliver best-in-class payment technology and software solutions. Join our dynamic team and make your mark on the payments technology landscape of tomorrow. Summary of This Role Responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. Creates a bridge between development and operations by applying a software engineering mindset to system administration topics. Splits time between operations/on-call duties and developing systems and software that help increase site reliability and performance. What Part Will You Play? Pushing our systems to their limits, and then coming up with designs for how to get them to the next performance tier. Use practices from DevOps and GitOps to improve automation and processes to make self service possible. Safeguarding reliability. Ensuring that our services are highly available, resilient against disasters, self-monitoring, and self-healing. Running “game days” to test assumptions about reliability and learn what will break before it matters to customers. Reviewing designs with an eye toward increasing the holistic stability of our platform and identifying potential risks. Building systems to proactively monitor the health, performance and security of our production and non-production virtualized infrastructure. Improving our monitoring and alerting systems to make sure engineers get paged when it matters (and don’t get paged when it doesn’t). Troubleshooting systems and network issues, alongside our Technical Operations Team. Evolving our SDLC, practices, and tooling to account for Site Reliability considerations and best practices. Developing runbooks and improving documentation. Chaos engineering - you’re expected to think laterally about how our systems might fail in theory, design tests to demonstrate how they behave in practice, and then formulate and implement remediation plans, as appropriate.

Requirements

BS in Computer Science, Information Technology, Business / Management Information Systems or related field (preferred)
Typically minimum of 2 years relevant experience
Knowledge of SDLCs, scripting languages (Go, Powershell, Bash, etc), building and managing GCP environments, automating autonomous solutions, change management, incident management, and agile methodologies.
Experience with GCP, Jenkins, Terraform, Ansible, OpenShift, and Kubernetes.
Experience with monitoring platforms including GCP, Datadog, Thousand Eyes, Logic Monitor, Logstash, Splunk, Looker Studio, etc.
Applicants MUST be authorized to work in the U.S.
We are UNABLE to sponsor or take over sponsorship of an employment Visa or Student Visa at this time.

Nice To Haves

Experience with GCP, Terraform, Kubernetes, pipeline management, and monitoring platforms, such as, Thousand Eyes, splunk, logstash, Kilbana, lookerstudio, and the like is a major plus.

Responsibilities

Pushing our systems to their limits, and then coming up with designs for how to get them to the next performance tier.
Use practices from DevOps and GitOps to improve automation and processes to make self service possible.
Safeguarding reliability. Ensuring that our services are highly available, resilient against disasters, self-monitoring, and self-healing.
Running “game days” to test assumptions about reliability and learn what will break before it matters to customers.
Reviewing designs with an eye toward increasing the holistic stability of our platform and identifying potential risks.
Building systems to proactively monitor the health, performance and security of our production and non-production virtualized infrastructure.
Improving our monitoring and alerting systems to make sure engineers get paged when it matters (and don’t get paged when it doesn’t).
Troubleshooting systems and network issues, alongside our Technical Operations Team.
Evolving our SDLC, practices, and tooling to account for Site Reliability considerations and best practices.
Developing runbooks and improving documentation.
Chaos engineering - you’re expected to think laterally about how our systems might fail in theory, design tests to demonstrate how they behave in practice, and then formulate and implement remediation plans, as appropriate.