Staff Cloud Engineer - Site Reliability Engineering

Kinaxis Inc.•Remote,

22h•Hybrid

About The Position

The Site Reliability Engineering (SRE) team owns the delivery, operation, and monitoring of Kinaxis products and cloud infrastructure, ensuring services remain reliable, performant, and available for global customers 24/7. We build and operate automation at scale using infrastructure-as-code, deployment pipelines, and platform tooling such as Terraform, GitHub Actions, ArgoCD, and Ansible. We’re seeking a Staff Cloud Engineer to design, build, and operate resilient systems across cloud environments. You will work closely with Product, Security, Platform, and Support teams while driving an automation-first approach, including building and evolving Hands-on Lab environments to improve consistency and scalability. In this hands-on role, you’ll lead technical initiatives, support incident response, and continuously enhance platform reliability and performance.

Requirements

Bachelor’s/Master's degree in Engineering with specialization in Computer Science or related discipline or demonstrated equivalent experience.
Prior experience in development and operational roles.
5+ years of experience managing production systems in GCP, Azure or AWS.
5+ years of experience developing managed solutions: Configuration management (Ansible), Infrastructure management (Terraform), CI/CD solutions (ArgoCD)
5+ years of experience developing in scripting languages: Python, PowerShell, Bash
Knowledge of containers and orchestrators: Kubernetes, Helm
Ability to share knowledge with others in the company.
Proactive communication and documentation skills.
Ability to work independently, and as part of a team.
#Senior

Responsibilities

Build and operate production systems that meet SLA targets and support customer workloads at scale.
Develop and maintain automation across multiple public clouds (Azure, GCP and datacenters), infrastructure-as-code (Terraform), deployment pipelines (GitHub Actions and Terraform) and scripting languages (Ansible, Python, PowerShell, Bash).
Own the lifecycle of production and hands-on lab environments, including deployment, resource management, incident triage, and troubleshooting.
Participate in on-call rotation, resolving incidents and driving root cause analysis and long-term fixes.
Work closely with Product, Platform, and Support teams and take ownership of complex operational problems, improving how the platform runs over time.
Provide technical guidance and mentorship to other engineers, helping improve how the team designs, builds, and operates systems.