Reliability in Azure Traffic Manager

Azure Traffic Manager is a DNS-based traffic load balancer that distributes traffic optimally across globally distributed backends. Traffic Manager provides high availability and quick responsiveness for your public-facing applications by using DNS to direct client requests to appropriate service endpoints based on traffic-routing methods and endpoint health monitoring.

When you use Azure, reliability is a shared responsibility. Microsoft provides a range of capabilities to support resiliency and recovery. You're responsible for understanding how those capabilities work within all of the services you use, and selecting the capabilities you need to meet your business objectives and uptime goals.

This article describes the reliability capabilities of Azure Traffic Manager in response to a range of potential outages, including transient faults and region-wide failures. It also highlights key considerations for maintaining resilience and preparing for recovery, and provides an overview of the Azure Traffic Manager service-level agreement (SLA).

Note

This article describes how the Traffic Manager service is resilient, or how you can make it resilient, to various problems. It doesn't explain how to use Traffic Manager to perform failover between applications or regions. For an example failover architecture, see Multitier web application built for high availability and disaster recovery.

Production deployment recommendations

The Azure Well-Architected Framework provides recommendations for reliability, performance, security, cost, and operations. To learn how these areas influence each other and contribute to a reliable Traffic Manager solution, see Architecture best practices for Azure Traffic Manager in the Well-Architected Framework.

Reliability architecture overview

This section describes some of the important aspects of how the service works that are most relevant from a reliability perspective. The section introduces the logical architecture, which includes some of the resources and features that you deploy and use. It also discusses the physical architecture, which provides details on how the service works under the covers.

Logical architecture

When you use Traffic Manager, you deploy a profile, which specifies your application's back-end endpoints and configures how Traffic Manager should route requests to those endpoints. For more information, see Traffic Manager endpoints and Traffic Manager routing methods.

A Traffic Manager profile presents as a DNS CNAME record. When it receives a resolution request from a client or DNS resolver, Traffic Manager dynamically resolves the IP address based on rules you specify in the profile. Traffic Manager's responsibility is to provide clients with the IP address of an endpoint to reach your service. After name resolution, none of your application's traffic flows through Traffic Manager. For more information, see How Traffic Manager Works.

Traffic Manager monitors the health of your endpoints, and routes incoming requests to healthy endpoints while avoiding unhealthy endpoints. For more information, see Traffic Manager endpoint monitoring.

Important

The reliability of your overall solution depends on the configuration of the endpoints that your traffic manager routes traffic to.

This article doesn't cover your endpoints, but their availability configurations directly affect your application's resilience. Review the reliability guides for Azure services in your solution to learn how each service supports your reliability requirements.

Physical architecture

Traffic Manager operates as a nonregional service and deploys its infrastructure across multiple availability zones in multiple Azure regions worldwide. This design enables Traffic Manager to remain resilient during an availability zone or region outage, because infrastructure in another zone or region continues to respond to resolution requests.

Global internet protocols like Anycast, DNS, and BGP automatically route incoming DNS resolution requests to the nearest healthy Traffic Manager infrastructure.

Resilience to transient faults

Transient faults are short, intermittent failures in components. They occur frequently in a distributed environment like the cloud, and they're a normal part of operations. Transient faults correct themselves after a short period of time. It's important that your applications can handle transient faults, usually by retrying affected requests.

All cloud-hosted applications should follow the Azure transient fault handling guidance when they communicate with any cloud-hosted APIs, databases, and other components. For more information, see Recommendations for handling transient faults.

Traffic Manager operates at the DNS level and uses health probes to monitor endpoint availability. The service handles transient faults through its global DNS infrastructure and endpoint monitoring capabilities.

When you use Traffic Manager, consider the following types of transient faults separately:

Transient faults during DNS resolution: If a transient fault occurs during DNS resolution, the client or intermediate resolver should retry.
Transient faults affecting your back-end endpoints: Traffic Manager endpoint monitoring checks the health of your endpoints regularly. A transient fault within an endpoint, or in the network path to an endpoint, might be detected as an unhealthy endpoint. Configure endpoint monitoring to look for consecutive problems over a period of time.

Your DNS record’s time to live (TTL) determines how your solution handles faults. If the TTL is very low, clients need to make more requests to Traffic Manager and there are more potential opportunities for transient faults to arise. If the TTL is very high, in the event of a true fault in an endpoint, clients might experience delays in failover until the TTL expires. Configure TTLs carefully to balance availability, latency, and responsiveness. When you use Azure DNS, it can automatically configure your record's TTL to match the profile's TTL value, which is 60 seconds by default. For more information, see Performance considerations for Traffic Manager.

Resilience to availability zone failures

Availability zones are physically separate groups of datacenters within an Azure region. When one zone fails, services can fail over to one of the remaining zones.

Traffic Manager operates as a nonregional service and deploys its infrastructure across multiple availability zones in multiple Azure regions worldwide. It replicates changes to your profile synchronously across these zones and regions. This design enables Traffic Manager to remain resilient during an availability zone outage, because infrastructure in another zone or region continues to respond to resolution requests.

Resilience to region-wide failures

Traffic Manager operates as a nonregional service and deploys its infrastructure across multiple availability zones in multiple Azure regions worldwide. This design enables Traffic Manager to remain resilient during a region outage, because infrastructure in another zone or region continues to respond to resolution requests.

Resilience to portal and management tool outages

If you manage your Traffic Manager profile in the Azure portal, prepare for scenarios where you can’t access it, especially if you need to reconfigure your profile during a platform outage.

Like other Azure services, Traffic Manager supports deployment and management through a variety of tools. We recommend you familiarize yourself with how to use Azure CLI or Azure PowerShell to manage your profile. Alternatively, deploy and configure your profile by using infrastructure as code technologies like Bicep or Terraform. These tools remain operational even if the Azure portal is degraded.

Backup and restore

Traffic Manager is a stateless DNS service. It doesn't persist your data and has no backup or restore capability.

To protect your resource configuration, define your Traffic Manager profiles and other resources using infrastructure as code (such as Bicep or ARM templates) and store those definitions in source control. If you need to recreate a resource, redeploy it from the stored configuration.

Resilience to service maintenance

Microsoft regularly applies service updates and performs other maintenance. The Azure platform handles these activities automatically, ensuring that maintenance is seamless and transparent to you. No downtime is expected during maintenance events unless you've been advised through Azure Service Health planned maintenance.

Service-level agreement

The service-level agreement (SLA) for Azure services describes the expected availability of each service and the conditions that your solution must meet to achieve that availability expectation. For more information, see SLAs for online services.

Azure Traffic Manager provides a 100% availability SLA for DNS query responses, as long as clients retry failed requests repeatedly.

Feedback

Was this page helpful?

Last updated on 2026-05-12