Infrastructure Reliability Engineer

IKO North AmericaMississauga, ON
$110,000 - $129,000Onsite

About The Position

The IT Infrastructure Reliability Engineer plays a critical role in ensuring the availability, performance, and resilience of enterprise technology systems across a complex, globally distributed environment. Reporting to the Global Director of Infrastructure and Operations, this individual will serve as a subject matter expert in observability, monitoring, alerting, application performance, while actively contributing to governance and architectural decisions through membership on the Architecture Review Board.

Requirements

  • 5+ years of experience in IT infrastructure, site reliability engineering (SRE), or a related operations role.
  • Demonstrated expertise in monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, Dynatrace, New Relic, or Splunk).
  • Solid understanding of APM concepts and hands-on experience instrumenting applications in enterprise environments.
  • Experience with ITSM and change management processes (ITIL certification preferred).
  • Proficiency with cloud platforms (AWS, Azure, GCP, OCI) and hybrid infrastructure architectures.
  • Familiarity with containerization and orchestration technologies (Docker, Kubernetes).
  • Experience with scripting or automation languages (Python, PowerShell…) and infrastructure-as-code tools (Ansible, Terraform).
  • Strong communication skills with the ability to convey complex technical information to both technical and non-technical audiences.

Nice To Haves

  • Experience in a formal Site Reliability Engineering (SRE) function with ownership of SLOs and error budgets.
  • Background in enterprise architecture governance or participation in architecture review processes.
  • Certifications such as AWS Solutions Architect, Google Professional Cloud Architect, ITIL v4, or CKA/CKAD.
  • Familiarity with observability frameworks such as OpenTelemetry.
  • Experience in regulated industries with compliance-driven change controls.

Responsibilities

  • Monitoring, Observability & Alerting Design, implement, and maintain comprehensive monitoring solutions across on-premises, cloud, and hybrid infrastructure environments.
  • Develop observability frameworks leveraging metrics, logs, and distributed tracing to provide end-to-end visibility into system health and performance.
  • Define and manage alerting thresholds, escalation policies, and on-call runbooks to enable rapid incident detection and response.
  • Continuously evaluate and improve monitoring tooling (e.g., SolarWinds, Prometheus, Grafana, Splunk, Dynatrace) to align with organizational needs.
  • Establish SLOs, SLIs, and error budgets to measure and communicate reliability targets to business and technical stakeholders.
  • Lead the deployment and optimization of APM tools to monitor application response times, throughput, error rates, and resource utilization.
  • Collaborate with development teams to instrument applications where applicable and integrate performance monitoring into development pipelines.
  • Conduct proactive performance analysis to identify bottlenecks, regressions, and optimization opportunities before they impact end users.
  • Develop dashboards and reports that surface actionable insights for engineering, operations, and leadership teams.
  • Participate in post-incident reviews to identify root causes and drive improvements to application reliability and observability.
  • Serve as a technical liaison in the Change Advisory Board (CAB) process, evaluating infrastructure and platform changes for reliability risk.
  • Evaluate and improve change management standards, including pre-change testing, rollback planning, and post-change validation procedures.
  • Coordinate scheduled maintenance windows and communicate impact assessments to stakeholders and service owners.
  • Maintain change records and audit trails in the ITSM platform (ServiceNow) to support compliance and reporting.
  • Champion a culture of disciplined, risk-aware change practices across the I&O team.
  • Participate as a standing member of the Architecture Review Board, providing reliability, observability, and operational readiness input on proposed solutions.
  • Review and assess new infrastructure designs, cloud services, and technology platforms for alignment with reliability engineering standards.
  • Contribute to the development and maintenance of architecture principles, infrastructure reference architectures, and technology standards.
  • Work cross-functionally with Enterprise Architects, Security, and Development teams to ensure new capabilities are designed for operability and resilience.
  • Document ARB decisions and provide post-implementation feedback loops to inform future architectural guidance.
  • Develop and maintain infrastructure-as-code (IaC) for monitoring configurations, ensuring consistency and version control.
  • Support capacity planning efforts by analyzing trends in resource consumption and forecasting future infrastructure requirements.
  • Mentor junior engineers in reliability engineering principles, tooling, and best practices.
  • Contribute to the development of disaster recovery and business continuity plans, including regular DR testing.
  • Maintain up-to-date documentation for all monitoring, alerting, and operational runbooks.

Benefits

  • IKO recognizes that its success is due to the strength of its employees.
  • A primary goal of IKO is to promote individual employee's sense of accomplishment and contribution so that employees enjoy their association with IKO.
  • The Company invests in its employees so that they are the most knowledgeable in the industry, and undertakes great efforts to nurture loyalty to, and teamwork at, IKO.
  • We are pleased to offer competitive compensation, health care, a progressive and challenging workplace and a commitment to teamwork and integrity.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service