Senior Manager, Site Reliability Engineering (SRE) - Hybrid - Seattle

NordstromSeattle, WA
7d$191,000 - $297,000Hybrid

About The Position

We’re looking for a strategic and hands-on Senior Manager of Site Reliability Engineering to lead our SRE team in delivering resilient, scalable, and high-performing systems. This role is central to our mission of operational excellence and customer satisfaction. You’ll guide a team of talented engineers, champion automation, and collaborate across disciplines to ensure our infrastructure supports business growth and innovation. A day in the life... Lead & Inspire Build and mentor a high-performing SRE team. Foster a culture of ownership, innovation, and continuous learning. Drive Reliability Ensure the availability and performance of critical services through proactive monitoring, incident response, and root cause analysis. Automate Everything Reduce manual toil by implementing automation across deployment, recovery, and scaling processes. Monitor & Observe Define and execute observability strategies using New Relic, Splunk, and other tools to detect and resolve issues before they impact users. Collaborate & Align Partner with engineering, product, and operations teams to align reliability goals with business priorities. Plan for Scale Lead capacity planning and performance tuning for services running on AWS EKS and other cloud-native platforms. Measure & Improve Establish and track SLOs, SLAs, and error budgets. Continuously refine processes to improve system reliability and team efficiency.

Requirements

  • 5+ years in SRE, DevOps, or infrastructure engineering, with 2+ years in a leadership role.
  • Expertise in cloud platforms (especially AWS), container orchestration (Kubernetes, EKS), and CI/CD pipelines.
  • Proficiency in Python, Go, or Java.
  • Hands-on experience with New Relic, Splunk, Kubernetes
  • Strong analytical skills and a passion for root cause analysis and continuous improvement.
  • Clear, concise, and collaborative communicator who thrives in cross-functional environments.
  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

Nice To Haves

  • Experience with large-scale distributed systems.
  • Familiarity with ITIL or similar incident management frameworks.
  • Cloud certifications (e.g., AWS Solutions Architect, Google Cloud Professional Engineer).

Responsibilities

  • Lead & Inspire Build and mentor a high-performing SRE team.
  • Foster a culture of ownership, innovation, and continuous learning.
  • Drive Reliability Ensure the availability and performance of critical services through proactive monitoring, incident response, and root cause analysis.
  • Automate Everything Reduce manual toil by implementing automation across deployment, recovery, and scaling processes.
  • Monitor & Observe Define and execute observability strategies using New Relic, Splunk, and other tools to detect and resolve issues before they impact users.
  • Collaborate & Align Partner with engineering, product, and operations teams to align reliability goals with business priorities.
  • Plan for Scale Lead capacity planning and performance tuning for services running on AWS EKS and other cloud-native platforms.
  • Measure & Improve Establish and track SLOs, SLAs, and error budgets.
  • Continuously refine processes to improve system reliability and team efficiency.

Benefits

  • Medical/Vision, Dental
  • Retirement and Paid Time Away
  • Life Insurance and Disability
  • Merchandise Discount and EAP Resources
  • 401k
  • medical/vision/dental/life/disability insurance options
  • PTO accruals
  • Holidays
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service