Senior DevOps Engineer

Carnegie Learning Canada•St. John's, NL

39d•CA$95,000 - CA$125,000•Remote

About The Position

We are looking for a Senior DevOps Engineer to help ensure that the systems our students and teachers rely on every day are reliable, secure, scalable, observable, and high performing. This is a hands-on engineering role for someone who enjoys improving systems, reducing toil through automation, enabling developers, and strengthening operational excellence. You will work across CI/CD, infrastructure as code, observability, incident response, and cloud reliability, while helping modernize legacy practices and improve the developer experience. You should be a strong communicator who collaborates well in a distributed environment and is comfortable partnering across engineering, QA, support, and product teams. We value people who are practical, curious, accountable, and motivated by continuous improvement. Our environment includes AWS, Jenkins, CloudFormation, ECS/Fargate, GitHub, Jira, Splunk, New Relic, Cortex.io, Slack, Snowflake, Databricks, and other modern engineering tools and platforms. Location is flexible! This role can work from our St. John's, NL office on the water, or remote anywhere within Canada. Candidates must already be residing in Canada. No visa sponsorship is available.

Requirements

5+ years of experience building and operating production-grade cloud solutions, preferably in AWS
Cloud certification beyond Practitioner level, such as SysOps, DevOps, Solutions Architect, or Security
Strong hands-on experience with Jenkins, including Jenkins DSL, plugin ecosystem, CI/CD pipelines, and Git-based workflows
Strong scripting and automation skills, including Groovy and at least one additional language such as Python, Go, Java, or Bash
Experience with web applications and modern frameworks/languages such as JavaScript, TypeScript, Angular, Node.js, Django, or Laravel
Strong troubleshooting skills across the SDLC, including failed builds, pipeline issues, and infrastructure bottlenecks
Experience designing and implementing secure AWS infrastructure using Infrastructure as Code, preferably CloudFormation
Hands-on experience with Docker, containers, and container orchestration, especially ECS Fargate
Experience with high availability, load balancing, and content delivery platforms and practices
Strong cloud security and networking experience, including least-privilege access models, IAM policies, and secure infrastructure design
Experience with observability, logging, and performance monitoring tools for troubleshooting and capacity planning, preferably Splunk and New Relic
Experience with production change management, including rollback planning and documentation
Strong communication, presentation, and customer service skills, with the ability to work independently and solve complex technical problems

Nice To Haves

Experience leveraging AI-powered tools or platforms to improve operational efficiency, troubleshooting, automation, developer experience, or service reliability
Database knowledge (writing queries, troubleshooting, performance, and monitoring)
DevSecOps, including integrating code analysis and vulnerability scanning tools into the CI/CD pipeline; additionally, familiarity with cybersecurity and regulatory frameworks (e.g., NIST, SOC 2, ISO, and COBIT)

Responsibilities

Develop and maintain Jenkins shared libraries and Jenkins pipelines using Groovy
Improve build, test, and deployment workflows to make software delivery more reliable and efficient
Partner with development and QA teams to support internal development and test environments
Help teams adopt better engineering practices around release quality, automation, and deployment confidence
Build and manage AWS infrastructure using Infrastructure as Code, primarily CloudFormation
Design, deploy, and improve secure, scalable cloud environment
Troubleshoot infrastructure and platform issues independently and drive long-term fixes
Help modernize legacy tooling and operational practices
Design and implement monitoring, alerting, trend analysis, and self-healing capabilities
Support SLIs and SLOs and help teams use reliability metrics to improve service health
Monitor and respond to alerts and production issues across applications and infrastructure
Participate in incident response and post-incident reviews, identifying both technical and process improvements
Assist support and engineering teams with log analysis, troubleshooting, and root cause investigation
Work effectively in a remote-first environment using tools like Slack, Jira, and shared documentation
Keep tasks, documentation, and operational runbooks current
Communicate clearly during both planned technical sessions and real-time incident situations
Contribute to a strong culture of teamwork, accountability, and customer focus
On-call on a monthly rotating basis (not heavy after hours but some is needed on occasion)

Benefits

Cost-shared health and dental benefits plan
Competitive Retirement Savings Matching Program to plan for your future
Flexible work arrangements with our Work From Anywhere Policy
Your Time, Your Way - paid time off that you can use as you see fit to recharge and nurture your personal life
Top-Up Parental Leave
Reduced working hours on full pay for new parents
Free access to CL products for employees and their children
Quarterly Wellness Incentives
Monthly employee activities + recognition program
Employee Allyship Groups (EAGs)