Nordstrom Technology seeks an exceptional Site Reliability Engineer with deep networking expertise to join our Nordstrom Operations Center (NOC) team. You'll maintain "eyes on glass" monitoring of application services and critical network infrastructure, ensuring the health and reliability of Nordstrom's retail operations. This role combines proactive monitoring, incident response, and root cause analysis with advanced network troubleshooting—diagnosing complex issues spanning the full stack and driving resolution of P1/P2 incidents that impact business operations. This role is offered as onsite in Seattle, WA supporting Nordstrom's 24/7 NOC. Candidates must be available to work in office at the Nordstrom corporate headquarters 5 days/week with shifts starting at 6:00 AM PST, including one weekend day per week (Saturday or Sunday) as part of regular rotation. A day in the life... Monitoring & Incident Response: Maintain real-time monitoring across application services, network infrastructure, and business KPIs (site visitors, order flow, revenue-impacting metrics) Participate in 24/7 on-call rotations, responding to PagerDuty alerts and managing incidents through ServiceNow workflows Lead P1/P2 incident troubleshooting, coordinating with engineering teams and vendors to restore service rapidly Perform real-time network diagnostics and performance testing during active incidents Network Operations: Monitor and troubleshoot routers, switches, firewalls, load balancers, wireless systems, and SD-WAN solutions Analyze network performance, identify bottlenecks, and recommend optimization strategies Investigate connectivity issues, VLAN configurations, routing problems, and security events Coordinate with network engineering during changes, maintenance windows, and infrastructure upgrades Maintain visibility into multi-vendor cloud environments (AWS, Azure) and cloud networking architectures Root Cause Analysis & Continuous Improvement: Conduct deep technical investigations focusing on credential expirations, service account failures, authentication incidents, and cascading failures Document findings in detailed RCA reports with actionable remediation steps Build and refine monitoring dashboards to improve Mean Time to Detect (MTTD) and Mean Time to Mitigate (MTTM) AI-Driven Operations & Automation: Contribute to AI-driven incident detection and automated response initiatives, building autonomous monitoring and remediation capabilities Develop scripts and automation to remediate common incidents, reduce manual toil, and accelerate response workflows Create automated health checks and build integrations between monitoring platforms (New Relic, PagerDuty, ServiceNow, Jira) Observability & Reliability: Enhance monitoring, logging, and alerting using New Relic or similar platforms Track operational metrics (MTTD, MTTM, incident trends) and build executive-level dashboards Support SLO/SLI definition and tracking for critical services and network infrastructure Collaborate with teams to improve fault tolerance, redundancy, and disaster recovery Collaboration & Leadership: Work closely with software engineering, infrastructure, and network teams to improve operational readiness Communicate effectively with stakeholders at all levels during incidents and post-incident reviews Contribute to NOC optimization including shift scheduling and process improvements
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Entry Level