Network Development Engineer, Office Network Reliability Engineering
Company: Amazon
Location: Austin
Posted on: April 4, 2026
|
|
|
Job Description:
About the Role: The Office Infrastructure Management (OIM) team
within Amazon IT Services is looking for a Network Development
Engineer to join our newly established Office Network Reliability
Engineering (ONRE) team. As an NDE on the ONRE team, you will be
responsible for ensuring 540K Amazonians across 400 corporate
offices experience highly available, reliable, and performant
networks. You will operate at the intersection of expert incident
resolution, systematic capability building, and proactive
reliability engineering — ensuring that the office network
infrastructure that underpins Amazonian productivity just works,
every time. This is not a traditional network operations role. You
are a builder. You will design and develop automation systems,
self-service tooling, and operational processes that scale Amazon's
ability to detect, respond to, and prevent network incidents. You
will serve as the Tier 3 escalation point for the Operations
Management Center (OMC) who are 24/7, resolving complex incidents
that require deep technical expertise while simultaneously building
the OMC's capability to handle those incidents independently in the
future. Your success is measured not only by your ability to
resolve escalations, but by your ability to systematically reduce
escalations. The ONRE team operates on a 24/7/365 follow-the-sun
model across three regional hubs: EMEA, APAC, and AMER. You will
participate in a rotating on-call schedule for high severity
escalations and partner closely with the OMC, Office Infrastructure
Excellence (OIE), AWS Enterprise Networking, and onsite IT support
teams. Key job responsibilities Tier 3 Expert Escalation Support
(approximately 40% of time) Serve as primary on-call for your
regional hub on a rotating schedule, providing 24/7 Tier 3
escalation support to the Operations Management Center for complex
office network incidents Diagnose and resolve advanced failure
scenarios including multi-site network outages, routing protocol
failures, wireless infrastructure degradation affecting multiple
access points, circuit performance problems requiring carrier
coordination, and configuration drift causing intermittent
customer-visible failures Troubleshoot across all layers of the
office network stack including wireless (LAN, WAN, 802.11), routing
and switching (BGP, OSPF, VLANs, STP), network authentication
(802.1X, RADIUS, ISE), and circuit infrastructure Take end-to-end
ownership of escalations, maintaining clear communication with the
OMC throughout resolution to ensure uninterrupted visibility into
customer-impacting issues and act as the SME between AWS Networking
and the OMC Create and maintain runbooks, diagnostic guides, and
tribal knowledge documentation for complex failure scenarios,
ensuring institutional knowledge is accessible and actionable OMC
Capability Building and Knowledge Transfer (approximately 30% of
time) Conduct structured learned sessions after every high severity
(Sev 1/2) incident to systematically identify what prevented the
OMC from resolving the incident independently, whether training
gaps, permission limitations, technical barriers, or tooling
deficiencies Develop automation, self-service tools, and
decision-tree troubleshooting guides that enable OMC engineers to
independently handle incidents that previously required Tier 3
escalation Deliver monthly knowledge transfer training sessions to
OMC Tier 1 and Tier 2 engineers covering complex failure patterns,
diagnostic techniques, and resolution approaches based on real
escalation data Track escalation patterns week-over-week through
OMC operational reviews, using data to identify systemic issues and
prioritize capability building investments Build strong working
partnerships with OMC engineers across all three regional hubs,
earning trust through responsiveness, transparency, and consistent
delivery Proactive Reliability Engineering (approximately 30% of
time) Execute Network Availability Risk (NAR) assessments to
proactively identify and remediate technical debt, known software
bugs, security vulnerabilities, and architectural risks before they
cause customer-impacting incidents Drive Operating System (OS)
Compliance programs to maintain 95% of the office network fleet on
production-certified operating system versions within 21 days of
release, partnering with AWS Enterprise Networking on validation
and rollout strategies Implement Configuration Compliance programs
to identify and eliminate configuration drift across the office
network fleet, deploying optimized and consistent configurations
that reduce failure rates Participate in Network Infrastructure
Validation (NIV) and Network OS Validation(NOV) reviews as a
gatekeeper for new network designs, ensuring operability,
monitoring readiness, runbook availability, and architectural
soundness before production deployment Contribute to 2026
engineering priorities including automation platform development,
monitoring improvements (visibility and alarming), and, iVPN and
certificate lifecycle automation Develop and integrate
Alarming(CI/CDK) into the Amazon eco-system (pipeline) for newer
platforms or existing platforms to improve observability in the
office space. A day in the life ONRE engineers operate at the
leading edge of where operational discipline meets engineering
innovation. On a given day, you might begin your shift by reviewing
overnight escalations from the team during handoff, identifying a
pattern in wireless controller failures that points to a
configuration gap rather than a hardware issue. You open a ticket,
document the root cause, and begin drafting an automated
remediation script that will allow OMC engineers to self-heal this
failure type going forward as the human in the loop. Later in the
day, you receive an escalation from the OMC Tier 2 team: a
multi-site WAN issue affecting three offices in your region with
Amazonians unable to access internal systems. You take the
escalation, engage the carrier, isolate the fault to a circuit
configuration issue introduced during a recent change, and restore
service within 40 minutes. You document the resolution and schedule
a lessons learned session for tomorrow to identify why the OMC did
not have the tooling or permissions to address this independently.
In the afternoon, you join a Network Infrastructure Validation
review for a new campus design, make recommendations on the
alerting coverage plan and pre-built runbooks before the design
moves to production. You close your shift by updating the team wiki
with your findings, handing off to the EMEA team, and reviewing
your open action items from recent lessons learned sessions. No two
days are the same. You will work in an environment where Amazon's
scale means there is no blueprint for every problem, and where your
ability to develop durable, scalable solutions has a direct and
visible impact on the productivity of hundreds of thousands of
Amazonians. About the team What We Offer Joining the NRE team means
joining a team of builders who operate at Amazon's scale with a
clear and measurable mission: make the office network invisible to
the 540K Amazonians who depend on it every day. You will: Build at
scale: Develop automation, tooling, and processes that impact 76K
network devices across 400 offices globally Own outcomes: Operate
with the mindset that the buck stops here — no escalation path
beyond your team for office network reliability Grow as a builder:
Apply your network expertise to develop innovative solutions that
go beyond standard industry patterns Drive measurable impact: Your
capability building work directly reduces escalation rates,
improves OMC self-sufficiency, and delivers visible improvements to
the Amazonian experience Work in a high-trust team: Operate with
significant autonomy as part of a globally distributed team of
experienced NDEs who value engineering excellence, intellectual
curiosity, and partnership - 4 years of major internet routing
protocols experience - 4 years of experience with enterprise
routing protocols including BGP, OSPF, MPLS, and their operational
behavior in large corporate or cloud provider network environments
- 4 years of experience operating and troubleshooting major network
platforms and operating systems including Cisco IOS, IOS-XE, NX-OS,
and/or Aruba AOS - 4 years of experience working independently and
as part of large, distributed engineering teams across time zones -
4 years of industry experience in large-scale network environments
including cloud provider, ISP, corporate enterprise, or large
carrier networks - Demonstrated experience in 24/7 on-call
operations for high severity incident response - Experience with
Cisco ISE, Aruba ClearPass, or equivalent Network Access Control
(NAC) platforms - Familiarity with IT Service Management platforms,
specifically ServiceNow, including incident management workflows,
TSG development, and CMDB - Experience building automation tooling,
self-service platforms, or operational runbooks for use by
operations teams with varying technical backgrounds - Track record
of conducting post-incident reviews, root cause analysis, and
lessons learned sessions with a focus on permanent defect
elimination Amazon is an equal opportunity employer and does not
discriminate on the basis of protected veteran status, disability,
or other legally protected status. Our inclusive culture empowers
Amazonians to deliver the best results for our customers. If you
have a disability and need a workplace accommodation or adjustment
during the application and hiring process, including support for
the interview or onboarding process, please visit
https://amazon.jobs/content/en/how-we-hire/accommodations for more
information. If the country/region you’re applying in isn’t listed,
please contact your Recruiting Partner. The base salary range for
this position is listed below. Your Amazon package will include
sign-on payments and restricted stock units (RSUs). Final
compensation will be determined based on factors including
experience, qualifications, and location. Amazon also offers
comprehensive benefits including health insurance (medical, dental,
vision, prescription, Basic Life & AD&D insurance and option
for Supplemental life plans, EAP, Mental Health Support, Medical
Advice Line, Flexible Spending Accounts, Adoption and Surrogacy
Reimbursement coverage), 401(k) matching, paid time off, and
parental leave. Learn more about our benefits at
https://amazon.jobs/en/benefits . USA, TX, Austin - 136,000.00 -
184,000.00 USD annually
Keywords: Amazon, Pflugerville , Network Development Engineer, Office Network Reliability Engineering, Engineering , Austin, Texas