Saas Operations Team Lead

CSC•Buffalo Grove, IL

7h•Remote

About The Position

SaaS Operations Team Lead Buffalo Grove, IL Monday – Friday 8:00 am – 5:00 pm Remote We’re seeking a talented and motivated hands-on Team Lead to lead our SaaS Operations with a strong Site Reliability Engineering (SRE) mindset. You’ll own and improve the reliability of our SaaS platform—treating availability, performance, and operational excellence as core product features. This role is Azure-first and cloud-forward, while operating in a hybrid environment (Microsoft Azure plus private infrastructure). Some of the things you’ll be doing: Lead the SaaS Operations/SRE team: prioritize work, mentor engineers, set standards, and act as the primary escalation point Own reliability outcomes: define and improve service health, availability, latency, and operational readiness Operate and optimize Azure services including Azure Front Door, Azure Container Apps, virtual networking, PaaS databases, and Key Vault Lead incident response end-to-end: triage, coordination, clear communications, and follow-through Drive root cause analysis and postmortems; ensure corrective actions are implemented and tracked Reduce operational toil through automation, self-service, and repeatable runbooks Build and refine observability: monitoring, logging, dashboards, and actionable alerting Manage day-to-day operational tickets and change activity following defined controls (incident/problem/change) Partner with Engineering, Infrastructure, and Security to improve operability and safe delivery (release readiness, rollout/rollback planning) Participate in an on-call rotation and planned maintenance windows after hours/weekends when needed

Requirements

5+ years in production operations (SRE, platform engineering, DevOps, SaaS operations, systems engineering, or similar)
Demonstrated technical leadership (team lead responsibilities, mentoring, ownership of operational standards)
Strong troubleshooting across distributed systems: web platform, networking, containers, identity, certificates/secrets, and performance bottlenecks
Azure production experience with: Azure Front Door Azure Container Apps Azure virtual networking (VNets, private endpoints, DNS patterns, hybrid connectivity concepts) Azure Key Vault PaaS databases
Automation and scripting: PowerShell, Bash, Azure CLI, and YAML-based pipelines/workflows
DevOps toolchain experience (GitHub and/or Azure DevOps); automation/config tooling such as Ansible (or equivalent)
ITSM/process discipline and tools (e.g., ServiceNow): incident, problem, change management
Hybrid environment requirements This position supports a hybrid platform. You must be able to operate and troubleshoot components running in private infrastructure, including: Enterprise identity systems (e.g., Active Directory, Group Policy) Web Platform (IIS) Microsoft server-based platforms and related operational practices (patching/maintenance, certificate lifecycle, file services such as DFS) Virtualization/hypervisor platforms (Nutanix AHV, VMware, or similar)

Nice To Haves

Infrastructure as Code experience (Bicep preferred; Terraform/ARM also valuable)
Experience implementing SLOs and improving alerting hygiene (noise reduction, paging policies)
Experience improving incident response practices (runbooks, escalation paths, reliability reviews)

Responsibilities

Lead the SaaS Operations/SRE team: prioritize work, mentor engineers, set standards, and act as the primary escalation point
Own reliability outcomes: define and improve service health, availability, latency, and operational readiness
Operate and optimize Azure services including Azure Front Door, Azure Container Apps, virtual networking, PaaS databases, and Key Vault
Lead incident response end-to-end: triage, coordination, clear communications, and follow-through
Drive root cause analysis and postmortems; ensure corrective actions are implemented and tracked
Reduce operational toil through automation, self-service, and repeatable runbooks
Build and refine observability: monitoring, logging, dashboards, and actionable alerting
Manage day-to-day operational tickets and change activity following defined controls (incident/problem/change)
Partner with Engineering, Infrastructure, and Security to improve operability and safe delivery (release readiness, rollout/rollback planning)
Participate in an on-call rotation and planned maintenance windows after hours/weekends when needed