Digital Site Reliability Engineer (SRE) - Local to Columbus, Ohio

CGI•Columbus, OH

5d•Onsite

About The Position

This position requires someone to be in an office work setting in Columbus, OH. This role combines deep expertise in cloud technologies with a strong focus on reliability, scalability, and automation, ensuring that digital services are robust, efficient, and aligned with business objectives. The engineer will work cross-functionally with development, operations, and security teams to implement best practices and drive innovation in cloud infrastructure. Your future duties and responsibilities: . Cloud Adoption Strategy: Collaborate with stakeholders to develop and execute strategies for adopting GCP services, including migration planning, architecture design, and implementation. . Reliability Engineering: Apply SRE principles to GCP environments, focusing on service reliability, availability, and scalability. Develop monitoring, alerting, and automation solutions to prevent outages and reduce manual intervention. . Cloud Infrastructure Management: Build, maintain, and optimize cloud infrastructure using Infrastructure as Code (IaC) tools such as Terraform or Deployment Manager. . Automation & CI/CD: Design and implement automated deployment pipelines and operational workflows to enable continuous integration and delivery of cloud-based applications. . Incident Management: Lead incident response for cloud-related issues, conduct root cause analysis, and implement corrective actions to improve system reliability. . Performance Optimization: Monitor system performance and proactively identify areas for improvement in cost, efficiency, and reliability. . Security & Compliance: Ensure cloud environments adhere to security best practices and compliance requirements. Collaborate with security teams to implement controls and monitor risk. . Documentation & Knowledge Sharing: Create and maintain technical documentation. Mentor and train team members on GCP adoption and SRE practices.

Requirements

Bachelor's degree in computer science, Engineering, or a related field.
3+ years of experience in cloud engineering, site reliability engineering, or DevOps, with hands-on expertise in GCP.
3+ years' experience in Infrastructure as Code (IaC) tools (e.g., Terraform, Deployment Manager).
3+ years' experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana).
3+ years' experience designing and implementing CI/CD pipelines and automation workflows.
3+ years' experience with troubleshooting and problem-solving skills, especially in distributed systems and cloud environments.
3+ years' experience working with SRE principles, including error budgets, SLIs/SLOs, and incident management.
3+ years' experience with cloud security best practices and regulatory compliance requirements.

Nice To Haves

Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams
Ability to work independently and multitask within a collaborative work environment
Willingness and aptitude for continuous improvement
Do the right thing attitude while being a strong team player
Strong communication and collaboration skills, focus on customer service
GCP Professional certifications
Experience migrating workloads from on-premises or other cloud platforms to GCP.
Familiarity with Kubernetes, Docker, and container orchestration in GCP.
Experience with agile methodologies and project management tools

Responsibilities

Cloud Adoption Strategy: Collaborate with stakeholders to develop and execute strategies for adopting GCP services, including migration planning, architecture design, and implementation.
Reliability Engineering: Apply SRE principles to GCP environments, focusing on service reliability, availability, and scalability. Develop monitoring, alerting, and automation solutions to prevent outages and reduce manual intervention.
Cloud Infrastructure Management: Build, maintain, and optimize cloud infrastructure using Infrastructure as Code (IaC) tools such as Terraform or Deployment Manager.
Automation & CI/CD: Design and implement automated deployment pipelines and operational workflows to enable continuous integration and delivery of cloud-based applications.
Incident Management: Lead incident response for cloud-related issues, conduct root cause analysis, and implement corrective actions to improve system reliability.
Performance Optimization: Monitor system performance and proactively identify areas for improvement in cost, efficiency, and reliability.
Security & Compliance: Ensure cloud environments adhere to security best practices and compliance requirements. Collaborate with security teams to implement controls and monitor risk.
Documentation & Knowledge Sharing: Create and maintain technical documentation. Mentor and train team members on GCP adoption and SRE practices.