Transient outage on Container Apps hosted YARP reverse proxy – suspected SNAT exhaustion

Question

Transient outage on Container Apps hosted YARP reverse proxy – suspected SNAT exhaustion

Conor Sheehan 0

We experienced a 13-minute availability outage on Sunday morning (03/05/2026, 04:44–04:57 UTC) affecting a YARP reverse proxy hosted on an Azure Container App routing through to a App Service, and I'm looking for help confirming the suspected root cause.

ARCHITECTURE OVERVIEW

We have a YARP reverse proxy running on an Azure Container App (West Europe) within a VNet-integrated Container App Environment. YARP routes inbound requests to several Azure App Service backends via their public azurewebsites.net endpoints. No NAT gateway is configured on the Container App environment's subnet, so outbound connections rely on Azure's default ephemeral SNAT.

SYMPTOMS

All availability checks against custom domains routed through YARP to one specific backend app service failed with 30-second timeouts for approximately 13 minutes
Availability checks targeting the same app service directly (bypassing YARP) passed throughout
Availability checks for a second, lower-traffic backend routed through the same YARP instance were unaffected
No errors or platform events were recorded on either the Container App or the app service during the window

INVESTIGATION FINDINGS

YARP's outbound dependency telemetry recorded 625 Canceled and 15 Faulted calls to the affected backend during the incident window, with average durations of ~29–32 seconds
The affected app service's own telemetry showed normal request volumes and response times throughout — it appeared healthy from its own perspective
Azure's health check probes on the app service passed continuously
No application errors, process recycling, or infrastructure events were found on either resource
The lower-traffic backend sharing the same YARP instance was unaffected, which is consistent with SNAT port exhaustion being triggered by higher connection volume to the affected backend
Connections appeared to be established but then hung indefinitely with no RST or error — characteristic of silent SNAT failure rather than a TLS or application issue

SUSPECTED ROOT CAUSE

We believe this was caused by transient SNAT port exhaustion or a SNAT allocation failure on the Container App environment's outbound path. The pattern — silent connection hangs, self-recovery after ~13 minutes, no application-level errors, and only the higher-traffic upstream being affected — all point to SNAT rather than an application or infrastructure fault.

QUESTIONS

Is there any way to confirm or rule out SNAT exhaustion as the root cause from the Azure platform side? We are not aware of any metrics or logs that expose SNAT port usage for Container App Environments.
Is a NAT gateway on the Container App environment subnet the recommended mitigation for SNAT exhaustion in this topology, or are there other approaches?
Would routing YARP to backends via private endpoints (eliminating the public internet hop and SNAT dependency entirely) be the preferred long-term solution for this architecture?

Thanks in advance for any guidance.

Pravallika KV 14,395 Reputation points Microsoft External Staff Moderator

2026-05-06T09:19:30.7233333+00:00
Hi @Conor Sheehan ,

Thanks for reaching out to Microsoft Q&A.

Here’s a quick run-through of your three questions:

Confirming or ruling out SNAT exhaustion

Unfortunately, there isn't a built-in SNAT‐port-usage metric surfaced in the Container Apps portal today.

What you can do is:

Enable diagnostic logs on your Container App Environment (via Azure Monitor) and look for dropped or timed-out outbound TCP connections over that 13-minute window.

Use AppLens or Application Insights snapshot debugging on your Container App to see “Canceled”/“Faulted” calls and correlate that to the timing of the outage.

In practice, the “silent hangs” you saw, combined with only the higher-traffic backend failing, lines up perfectly with an ephemeral SNAT port exhaustion scenario.

NAT gateway as a mitigation

Yes, attaching a NAT gateway to your Container Apps subnet is the recommended short-term and mid-term mitigation.

A NAT gateway gives you up to 64 K SNAT ports per assigned public IP, so you won’t exhaust ports under bursty loads.

As an alternative you can also move from a “Consumption only” environment to a “Workload profiles” (dedicated) plan, which has more SNAT flexibility and lets you request additional ports.

Long-term: private endpoints to backends

Absolutely, routing your YARP traffic over private endpoints (or service endpoints) to your App Services removes the need for SNAT entirely.

Your traffic stays on the VNet, you get higher security (no public egress), and zero risk of SNAT port exhaustion.

This is considered the best-practice architecture for production workloads that talk to other Azure PaaS services.

Hope that helps!

If you need help digging into the diagnostics or want the platform team to surface the internal SNAT-allocation logs, let us know and we can open a support case for you.
Conor Sheehan 0 Reputation points

2026-05-06T09:23:57.1066667+00:00

Thanks for your speedy response. Would there be anything we could do to confirm that in this instance it was indeed SNAT exhaustion in retrospect rather than just moving forwards? This is the first time we've had an instance like that and it has been running under practically the same architecture for at least a couple of years.

One of the main concerns I have is that actually over the outage time period the application was not under high load, in fact at around 5AM is when we have our lowest load in each day.
Pravallika KV 14,395 Reputation points Microsoft External Staff Moderator

2026-05-06T10:49:51.04+00:00
Unfortunately CAE today doesn’t surface a “SNAT‐ports‐in‐use” metric you can just turn on, but here are a few things you can do to confirm (or rule out) in retrospect, plus mitigation and long-term recommendations:

Retrospective confirmation:

Enable and review CAE diagnostic logs in Azure Monitor (ContainerAppEnvironmentDiagnostics). Look for any “TCP connection dropped” or “connection timed out” events over your 04:44–04:57 UTC window.

Use AppLens or Application Insights snapshot debugging on your YARP container to correlate the 625 “Canceled” and 15 “Faulted” calls to that exact timeframe. The pattern of ~30 s timeouts with no RST is classic SNAT exhaustion.

As a last resort, open an Azure support case and ask the platform team to pull internal SNAT‐allocation logs for your CAE’s subnet during that outage. They can confirm if the SNAT port pool was fully allocated.

Short-term/mid-term mitigation:

Attach a NAT gateway to your Container Apps Environment subnet. A NAT gateway gives you 64 K ephemeral ports per public IP, so bursty or high‐concurrency workloads won’t exhaust SNAT.

If you’re on a pure “Consumption only” plan, consider moving to a “Workload profiles” (dedicated) plan, it gives you more SNAT flexibility and lets you request additional port capacity.

Long-term best practice

Move your App Services behind private endpoints (or service endpoints) and route YARP over the VNet. That eliminates the public‐internet hop (and SNAT dependency) entirely, keeps traffic on the VNet.