An Azure service that provides a general-purpose, serverless container platform.
Transient outage on Container Apps hosted YARP reverse proxy – suspected SNAT exhaustion
We experienced a 13-minute availability outage on Sunday morning (03/05/2026, 04:44–04:57 UTC) affecting a YARP reverse proxy hosted on an Azure Container App routing through to a App Service, and I'm looking for help confirming the suspected root cause.
ARCHITECTURE OVERVIEW
We have a YARP reverse proxy running on an Azure Container App (West Europe) within a VNet-integrated Container App Environment. YARP routes inbound requests to several Azure App Service backends via their public azurewebsites.net endpoints. No NAT gateway is configured on the Container App environment's subnet, so outbound connections rely on Azure's default ephemeral SNAT.
SYMPTOMS
- All availability checks against custom domains routed through YARP to one specific backend app service failed with 30-second timeouts for approximately 13 minutes
- Availability checks targeting the same app service directly (bypassing YARP) passed throughout
- Availability checks for a second, lower-traffic backend routed through the same YARP instance were unaffected
- No errors or platform events were recorded on either the Container App or the app service during the window
INVESTIGATION FINDINGS
- YARP's outbound dependency telemetry recorded 625 Canceled and 15 Faulted calls to the affected backend during the incident window, with average durations of ~29–32 seconds
- The affected app service's own telemetry showed normal request volumes and response times throughout — it appeared healthy from its own perspective
- Azure's health check probes on the app service passed continuously
- No application errors, process recycling, or infrastructure events were found on either resource
- The lower-traffic backend sharing the same YARP instance was unaffected, which is consistent with SNAT port exhaustion being triggered by higher connection volume to the affected backend
- Connections appeared to be established but then hung indefinitely with no RST or error — characteristic of silent SNAT failure rather than a TLS or application issue
SUSPECTED ROOT CAUSE
We believe this was caused by transient SNAT port exhaustion or a SNAT allocation failure on the Container App environment's outbound path. The pattern — silent connection hangs, self-recovery after ~13 minutes, no application-level errors, and only the higher-traffic upstream being affected — all point to SNAT rather than an application or infrastructure fault.
QUESTIONS
- Is there any way to confirm or rule out SNAT exhaustion as the root cause from the Azure platform side? We are not aware of any metrics or logs that expose SNAT port usage for Container App Environments.
- Is a NAT gateway on the Container App environment subnet the recommended mitigation for SNAT exhaustion in this topology, or are there other approaches?
- Would routing YARP to backends via private endpoints (eliminating the public internet hop and SNAT dependency entirely) be the preferred long-term solution for this architecture?
Thanks in advance for any guidance.