Cloud Platform DevOps Engineer - Assistant Vice President

Citi•Mississauga, ON

3d•Onsite

About The Position

We are seeking an experienced (5+ years), motivated, and hands-on Cloud Platform DevOps Engineer to join our North American AI and DevOps Platform Engineering team. In this critical role, you will be responsible for enhancing the stability, reliability, and performance of our AI and DevOps platforms, which support a diverse ecosystem of AI applications, developer tools, and CI/CD pipeline technologies across the organization. You will actively contribute to infrastructure design, implementation, and maintenance, and facilitate agile development within the team. The ideal candidate is a strong technical leader who champions agile practices, drives continuous improvement, and excels in both coding and coaching, possessing a deep understanding of infrastructure and operational considerations for Artificial Intelligence and Machine Learning initiatives, with proven hands-on experience in DevOps tools and technologies such as Kubernetes, Docker, HELM, Ansible, DevOps tools, or similar CI/CD platforms, and proficiency in scripting and automation (e.g., Python, Bash). We are looking for someone with a track record of implementing scalable, resilient, and high-performance solutions, coupled with strong communication and collaboration skills, and an ability to mentor and guide junior team members, as you join a dynamic team committed to fostering innovation and collaboration.

Requirements

Proven hands-on experience with HashiCorp Vault (installation, configuration, policy management, integrations).
Strong hands-on experience with at least two of PostgreSQL, Oracle, or MongoDB (installation, tuning, replication, backup/restore).
Hands-on experience deploying, managing, and developing DAGs for Apache Airflow.
Solid hands-on experience with Kafka and/or IBM MQ (cluster setup, topic management, producer/consumer configuration).
In-depth hands-on experience with Kubernetes and Helm, including YAML configuration, troubleshooting PODs/Jobs/Deployments, and integrations with secrets management (CyberArk, HashiCorp).
Practical experience with Kubernetes PVCs, Persistent Volumes, S3, and/or enterprise NAS solutions (e.g., SONiC NAS).
Strong hands-on experience with Prometheus, Grafana, and the ELK Stack (setup, dashboard creation, query optimization, alert configuration).
High proficiency in Python, Bash, or Go for automation, tooling development, and system administration.
Extensive hands-on experience with at least one major cloud provider (AWS, Azure, GCP).
Proficiency with IaC tools such as Terraform or Ansible.
Experience designing, implementing, and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub Actions).
Experience with RESTful API and SOAP web services.
Proficiency with Gradle for build automation.
Understanding of the specific infrastructure requirements for deploying, managing, and scaling Artificial Intelligence and Machine Learning workloads (e.g., GPU resources, specialized storage, MLOps pipelines).
Awareness of data management strategies and data governance principles relevant to AI/ML models and training datasets.
Familiarity with metrics and monitoring approaches for the performance and health of AI/ML applications and their underlying infrastructure.
Proven experience acting as a Scrum Master within a technical team where you also performed significant hands-on engineering.
In-depth knowledge and practical application of Agile principles and the Scrum framework.
Excellent facilitation, coaching, and mentoring skills within a technical context.
Strong verbal and written communication skills, able to bridge technical and process discussions.
Ability to guide technical discussions, influence architectural decisions, and drive best practices.
Bachelor's or Master's degree in computer science, Engineering, or a related technical field or equivalent experience.

Nice To Haves

Certified ScrumMaster (CSM) or Professional Scrum Master (PSM) certification.
Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer Expert, GCP Professional Cloud DevOps Engineer).
Experience with site reliability engineering (SRE) principles and practices.
Familiarity with other Agile scaling frameworks (e.g., SAFe, LeSS).
Exposure to MLOps platforms or tools (e.g., Kubeflow, MLflow).

Responsibilities

Lead the design, implementation, and ongoing management of secure, scalable, and resilient infrastructure components.
Administer and maintain secret and certificate management solutions using HashiCorp Vault, including policy definition and integration.
Perform hands-on administration and optimization of database systems (PostgreSQL, Oracle, MongoDB), including performance tuning, backup, and recovery strategies.
Deploy, monitor, and troubleshoot data orchestration workflows using Apache Airflow, and develop/optimize DAGs.
Implement and manage messaging queues such as Kafka and IBM MQ, including cluster setup and configuration.
Develop, maintain, and troubleshoot RESTful API and SOAP integrations critical for system connectivity.
Implement and optimize build and deployment processes using Gradle.
Design, implement, and manage container orchestration platforms with Kubernetes and Helm, including integration with CyberArk and HashiCorp for secrets management. Create, debug, and troubleshoot Kubernetes PODs, Jobs, and Deployments using YAML.
Configure and manage persistent storage solutions including PVC, SONiC NAS, and S3, with an awareness of storage requirements for AI/ML workloads.
Set up and maintain load balancing solutions (e.g., Nginx, HAProxy, AWS ELB/ALB, Kubernetes Ingress controllers) for high availability and performance.
Implement, configure, and utilize comprehensive monitoring and logging solutions (Prometheus, Grafana, ELK Stack) to ensure system health and proactively identify issues, including those relevant to AI/ML applications.
Develop robust automation scripts and tools using Python, Bash, Go, or similar languages to streamline operations and enhance efficiency.
Participate actively in on-call rotations, responding to and resolving critical incidents with hands-on troubleshooting.
Create and maintain technical documentation, architecture diagrams, and runbooks for infrastructure components and processes.
Proactively identify and resolve technical impediments and process bottlenecks within the team and across organizational boundaries, paying special attention to unique challenges posed by AI/ML infrastructure.
Collaborate closely with stakeholders (e.g., product owners, technical leads) to ensure a well-defined and prioritized backlog for infrastructure work, technical debt, operational improvements, and AI/ML platform needs.
Drive continuous improvement in the team's agile and DevOps practices, helping them adapt and optimize their workflow for maximum efficiency and quality.