Systems Development Engineer, AWS Incident Response (AIR), AWS Incident Response (AIR)

AWS Incident Response (AIR) ensures the high availability of Amazon Web Services by making customer-impacting events shorter and less frequent through incident detection, management, and automated mitigation. Our systems monitor AWS infrastructure in real-time, automatically detect impairments, and orchestrate responses to minimize customer impact across regions and services.

As a Systems Development Engineer on the AIR team, you will lead the response to critical customer-impacting events — triaging impact, identifying root causes, coordinating mitigation actions with service teams, and driving resolution in real-time. Not every event is solved by automation; you will use your technical judgment to assess situations, engage the right teams, and direct mitigation strategies when manual intervention is required. Insights from these events directly inform the automation and tooling you build — creating a continuous improvement loop where each event makes the next one shorter or prevents it entirely.

This role offers a unique combination of systems development and real-time operational leadership, with direct impact on the availability of AWS services used by millions of customers.

Key job responsibilities
• Drive the resolution of large-scale customer-impacting incidents as part of an on-call rotation (including weekends and holidays), leading incident calls and coordinating resolver teams across AWS service organizations
• Design, build, and enhance incident detection, triage, and mitigation automation tools
• Author COEs and event deep-dive documents to identify improvement opportunities; create and lead action items that improve processes, tooling, and automation
• Identify recurring platform issues and own projects that eliminate entire classes of operational problems
• Collaborate with teams globally to expand incident response capabilities across AWS regions and services

A day in the life
A Systems Development Engineer on the AWS Incident Response (AIR) team has full visibility on all AWS services! There are limitless opportunities to learn as you will work with all AWS internal teams and have exposure to AWS products and services.

When on-call, your day may start with large scale event — you join the conference bridge, assess the scope of impact using real-time dashboards, identify impaired services, engage the right teams, and drive mitigation until the event is resolved. After the event, you lead the deep-dive, document findings, and create action items to prevent recurrence.

When off-call, you spend your time building and improving the tools that make incident response faster and more automated. You might be writing code to improve event detection logic, building dashboards that surface the right signals during triage, or working on automation that reduces manual steps during mitigation. You participate in design / code reviews and collaborate with engineers across AIR to drive operational improvements. You also invest time in learning AWS service architectures — understanding how services fail helps you respond faster when they do.

About the team
AWS Incident Response (AIR) is a globally distributed team responsible for leading the large-scale customer-impacting events across AWS. We operate 24/7, providing incident leadership and coordination for events that span multiple services and regions. Our engineers combine hands-on incident leadership with systems development — we build the automation and tooling we use, and every event teaches us how to make the next one shorter or prevent it entirely. The team values operational excellence, continuous learning, and a bias for action. We work closely with service teams, networking, and infrastructure organizations across AWS, giving our engineers broad exposure to how AWS operates under the hood.

Job Description

About Amazon