Network Development Engineer, Office Network Reliability Engineering

Amazon•Austin, TX

About The Position

About the Role: The Office Infrastructure Management (OIM) team within Amazon IT Services is looking for a Network Development Engineer to join our newly established Office Network Reliability Engineering (ONRE) team. As an NDE on the ONRE team, you will be responsible for ensuring 540K Amazonians across 400+ corporate offices experience highly available, reliable, and performant networks. You will operate at the intersection of expert incident resolution, systematic capability building, and proactive reliability engineering — ensuring that the office network infrastructure that underpins Amazonian productivity just works, every time. This is not a traditional network operations role. You are a builder. You will design and develop automation systems, self-service tooling, and operational processes that scale Amazon's ability to detect, respond to, and prevent network incidents. You will serve as the Tier 3 escalation point for the Operations Management Center (OMC) who are 24/7, resolving complex incidents that require deep technical expertise while simultaneously building the OMC's capability to handle those incidents independently in the future. Your success is measured not only by your ability to resolve escalations, but by your ability to systematically reduce escalations. The ONRE team operates on a 24/7/365 follow-the-sun model across three regional hubs: EMEA, APAC, and AMER. You will participate in a rotating on-call schedule for high severity escalations and partner closely with the OMC, Office Infrastructure Excellence (OIE), AWS Enterprise Networking, and onsite IT support teams.

Requirements

4+ years of major internet routing protocols experience
4+ years of experience with enterprise routing protocols including BGP, OSPF, MPLS, and their operational behavior in large corporate or cloud provider network environments
4+ years of experience operating and troubleshooting major network platforms and operating systems including Cisco IOS, IOS-XE, NX-OS, and/or Aruba AOS
4+ years of experience working independently and as part of large, distributed engineering teams across time zones
4+ years of industry experience in large-scale network environments including cloud provider, ISP, corporate enterprise, or large carrier networks
Demonstrated experience in 24/7 on-call operations for high severity incident response

Nice To Haves

Experience with Cisco ISE, Aruba ClearPass, or equivalent Network Access Control (NAC) platforms
Familiarity with IT Service Management platforms, specifically ServiceNow, including incident management workflows, TSG development, and CMDB
Experience building automation tooling, self-service platforms, or operational runbooks for use by operations teams with varying technical backgrounds
Track record of conducting post-incident reviews, root cause analysis, and lessons learned sessions with a focus on permanent defect elimination

Responsibilities

Serve as primary on-call for your regional hub on a rotating schedule, providing 24/7 Tier 3 escalation support to the Operations Management Center for complex office network incidents
Diagnose and resolve advanced failure scenarios including multi-site network outages, routing protocol failures, wireless infrastructure degradation affecting multiple access points, circuit performance problems requiring carrier coordination, and configuration drift causing intermittent customer-visible failures
Troubleshoot across all layers of the office network stack including wireless (LAN, WAN, 802.11), routing and switching (BGP, OSPF, VLANs, STP), network authentication (802.1X, RADIUS, ISE), and circuit infrastructure
Take end-to-end ownership of escalations, maintaining clear communication with the OMC throughout resolution to ensure uninterrupted visibility into customer-impacting issues and act as the SME between AWS Networking and the OMC
Create and maintain runbooks, diagnostic guides, and tribal knowledge documentation for complex failure scenarios, ensuring institutional knowledge is accessible and actionable
Conduct structured learned sessions after every high severity (Sev 1/2) incident to systematically identify what prevented the OMC from resolving the incident independently, whether training gaps, permission limitations, technical barriers, or tooling deficiencies
Develop automation, self-service tools, and decision-tree troubleshooting guides that enable OMC engineers to independently handle incidents that previously required Tier 3 escalation
Deliver monthly knowledge transfer training sessions to OMC Tier 1 and Tier 2 engineers covering complex failure patterns, diagnostic techniques, and resolution approaches based on real escalation data
Track escalation patterns week-over-week through OMC operational reviews, using data to identify systemic issues and prioritize capability building investments
Build strong working partnerships with OMC engineers across all three regional hubs, earning trust through responsiveness, transparency, and consistent delivery
Execute Network Availability Risk (NAR) assessments to proactively identify and remediate technical debt, known software bugs, security vulnerabilities, and architectural risks before they cause customer-impacting incidents
Drive Operating System (OS) Compliance programs to maintain 95% of the office network fleet on production-certified operating system versions within 21 days of release, partnering with AWS Enterprise Networking on validation and rollout strategies
Implement Configuration Compliance programs to identify and eliminate configuration drift across the office network fleet, deploying optimized and consistent configurations that reduce failure rates
Participate in Network Infrastructure Validation (NIV) and Network OS Validation(NOV) reviews as a gatekeeper for new network designs, ensuring operability, monitoring readiness, runbook availability, and architectural soundness before production deployment
Contribute to 2026 engineering priorities including automation platform development, monitoring improvements (visibility and alarming), and, iVPN and certificate lifecycle automation
Develop and integrate Alarming(CI/CDK) into the Amazon eco-system (pipeline) for newer platforms or existing platforms to improve observability in the office space.

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume