Sr. Manager, Site Reliability Engineering (Hybrid - Seattle, WA)

Nordstrom•Seattle, WA

4d•Hybrid

About The Position

We’re looking for a strategic and hands-on Sr. Manager of Site Reliability Engineering to lead reliability at scale for one of retail’s most complex engineering platforms. You’ll lead a team of talented engineers serving ~2,500 internal developer customers, championing automation and operational excellence to ensure our platform infrastructure enables engineering velocity and business innovation. A day in the life... Lead & Inspire - Build and mentor a high-performing SRE team that takes pride in platform ownership. Foster a culture of growth, initiative, and continuous improvement. Drive Reliability - Own the availability and performance of critical services through proactive monitoring, disciplined incident response, and thorough root cause analysis — catching problems before developers ever feel them. Automate Everything - Drive meaningful reduction of manual toil through automation across deployment, recovery, and scaling processes — freeing your team to focus on higher-impact work. Champion AI-Augmented Operations - Lead adoption of AI tooling across SRE workflows including automated incident triage, anomaly detection, and AI-assisted on-call response. Partner with the AI & ML Enablement team to build intelligent operational capabilities that give us a meaningful edge. Monitor & Observe - Define and execute observability strategies across our stack using OpenTelemetry, CloudWatch, Amazon Timestream, Splunk, and other tools — building the telemetry foundation to detect and resolve issues before they impact developers. Collaborate & Align - Build strong partnerships across engineering, product, and operations — translating reliability goals into business priorities and vice versa. Plan for Scale - Lead capacity planning and performance tuning for services running on our multi-cloud Kubernetes platform spanning AWS EKS and GCP GKE, with HPA/VPA/KEDA autoscaling across clusters. Measure & Improve - Establish and track SLOs, SLAs, and error budgets. Use them to drive continuous improvement in system reliability and team efficiency, and report progress regularly to executive leadership.

Requirements

Experience - 5+ years in SRE, DevOps, or infrastructure engineering, with 4+ years in a leadership role, ideally managing multi-team or platform engineering organizations.
Technical Depth - Strong expertise in cloud platforms (AWS and GCP), container orchestration (Kubernetes, EKS), and CI/CD pipelines including supply chain security (container signing, SBOM, OPA policy validation).
Programming Skills - Proficiency in Python, Go, or Java.
Tool Mastery - Hands-on experience with OpenTelemetry, CloudWatch, Amazon Timestream, Splunk, Kubernetes, Kafka
Problem Solver - Strong analytical skills and a genuine passion for root cause analysis and continuous improvement.
Communicator - A clear, concise, and collaborative communicator who can translate technical complexity for executive audiences and work hands-on with engineers.
Education - bachelor’s degree in computer science, Engineering, or equivalent experience.

Nice To Haves

Experience with large-scale distributed systems in a multi-cloud environment (AWS and GCP).
Experience with AI-assisted SRE operations: incident triage, anomaly detection, or AI-augmented on-call tooling.
Familiarity with developer platform SRE: internal developer platforms (IDPs), platform reliability metrics, and developer experience measurement.
Cloud certifications (e.g., AWS Solutions Architect, Google Cloud Professional Engineer).

Responsibilities

Lead & Inspire - Build and mentor a high-performing SRE team that takes pride in platform ownership. Foster a culture of growth, initiative, and continuous improvement.
Drive Reliability - Own the availability and performance of critical services through proactive monitoring, disciplined incident response, and thorough root cause analysis — catching problems before developers ever feel them.
Automate Everything - Drive meaningful reduction of manual toil through automation across deployment, recovery, and scaling processes — freeing your team to focus on higher-impact work.
Champion AI-Augmented Operations - Lead adoption of AI tooling across SRE workflows including automated incident triage, anomaly detection, and AI-assisted on-call response. Partner with the AI & ML Enablement team to build intelligent operational capabilities that give us a meaningful edge.
Monitor & Observe - Define and execute observability strategies across our stack using OpenTelemetry, CloudWatch, Amazon Timestream, Splunk, and other tools — building the telemetry foundation to detect and resolve issues before they impact developers.
Collaborate & Align - Build strong partnerships across engineering, product, and operations — translating reliability goals into business priorities and vice versa.
Plan for Scale - Lead capacity planning and performance tuning for services running on our multi-cloud Kubernetes platform spanning AWS EKS and GCP GKE, with HPA/VPA/KEDA autoscaling across clusters.
Measure & Improve - Establish and track SLOs, SLAs, and error budgets. Use them to drive continuous improvement in system reliability and team efficiency, and report progress regularly to executive leadership.