SRE/DevOps Engineer - 66764

Hitachi•Toronto, ON

About The Position

The L1 SRE is the first line of defense in monitoring, triaging, and executing standardized operational tasks for all enterprise applications running on standard patterns and platforms like Kubernetes, APIs, WAF, databases, API Proxy (Gloo, APIGEE), Kafka, and Cloud (AWS/Azure/GCP). They will follow runbooks, leverage automation, and escalate appropriately to minimize downtime.

Requirements

System & Infrastructure Monitoring: Ability to use monitoring dashboards (e.g., Grafana, Datadog, Splunk, Argos, AIOps) to identify anomalies, follow alert workflows, and escalate when thresholds are breached.
Runbook Execution: Strictly follow documented steps to resolve standard incidents, escalate when steps do not apply or fail.
Incident Triage & Communication: Perform first-line triage of alerts, gather logs/metrics, categorize severity, and notify stakeholders in clear, concise language.
Kubernetes (Cloud or onprem) operations knowledge: Ability to check pod status, understand logs, and verify service endpoints using kubectl and monitoring tools.
Scripting (Python, Bash, PowerShell): Able to read and make small edits to scripts to automate repetitive checks.
Networking & Security Awareness: Understand troubleshooting (ping, netstat, curl, traceroute) and know when issues may be related to firewall, WAF, or proxy.
Documentation & Knowledge Capture: Accurately record steps taken during incidents, suggest runbook updates where gaps exist.
2–5 years in IT operations, NOC, or SRE/DevOps engineer role.
Kubernetes 101, Linux 101, Networking 101
Understanding of cloud-ready applications
Understanding of observability tools (Prometheus, Grafana, ELK, Splunk, etc.).
Strong troubleshooting mindset, ability to follow structured workflows. Eg: 5 Why?s and Fishbone

Nice To Haves

Cloud Platform Familiarity (AWS, Azure, GCP): Understand basics of cloud services (VMs, load balancers, storage) and how to navigate a cloud console.
Database Basics (SQL/NoSQL): Run simple queries to validate DB connectivity and health.
Automation & Self-Service Mindset: Identify repetitive manual steps and propose candidates for automation.
Exposure to Incident Management Tools (xMatters, ServiceNow, Jira, etc.): Comfortable working within ITSM/incident workflows.
AI/Chatbot-Assisted Ops (emerging skill): Use AI assistants to search runbooks or suggest remediation steps.

Responsibilities

Monitor system health, alerts, dashboards, and logs across cloud and on-prem infrastructure.
Ability to isolate functional issue with application versus platform
Execute standardized runbooks for incident resolution, deployments, and routine tasks.
Perform initial triage of incidents and escalate to L2/L2+ as needed to mitigate the issue to get to bypass.
Document new issues, gaps in runbooks, and automation opportunities.
Provide excellent communication to stakeholders during incidents.
Support onboarding of new applications into the operations framework.

Benefits

industry-leading benefits, support, and services that look after your holistic health and wellbeing
flexible arrangements that work for you (role and location dependent)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume