Tags: azure-devops, managed-devops-pools, azure-pipelines, vmss
Since approximately 2026-03-27, our Managed DevOps Pool agents have been randomly freezing mid-job. The agent stops producing log output entirely, ADO receives no further heartbeat, and the job stays in "Running" state until the pipeline timeout kills it. This affects both our MDP pools independently.
ENVIRONMENT
JDNextBuild-Set (primary concern):
- VM SKU: Standard_D4s_v3
- OS image: Ubuntu 24.04 (custom Azure Image Builder image)
- Agent version: 4.270.0 (recently also 4.271.0)
- Region: East US
- Network: VNet-injected, dedicated subnet
- Max concurrency: 15
Terraform-Set:
- VM SKU: Standard_D2s_v3
- OS image: Ubuntu 24.04 (custom Azure Image Builder image)
- Agent version: 4.270.0 (recently also 4.271.0)
- Region: East US
- Network: VNet-injected, dedicated subnet
- Max concurrency: 10
Both pools are on separate subnets, separate VM SKUs, and completely different software stacks.
SYMPTOMS
- Random freeze point: the agent can freeze at any step. On JDNextBuild-Set we have seen freezes during Docker builds, helm upgrade, kubectl operations, and PowerShell steps expected to complete in under 2 seconds. There is no consistent step or pattern.
- No error output: the last log line is a normal, successful operation. No exception, exit code, or warning precedes the freeze.
- Agent heartbeat stops: ADO receives no further communication. The job sits in "Running" until the timeout fires.
- Agent diagnostic logs folder is empty: downloading the job logs and opening the "Agent Diagnostic Logs" folder shows no files whatsoever. This indicates the issue occurs before the agent completes its lifecycle normally.
- MDPResourceLog shows no errors: checked in Log Analytics, no errors or warnings appear for either pool during affected periods.
- MDP provisioning metrics are clean: no CustomScriptError or provisioning failures in the Metrics blade.
EXAMPLE FROM RUN LOGS
From a Terraform-Set job:
2026-03-29T02:00:06Z Starting: Terraform Plan - bootstrap
...normal output...
[complete silence for 49 minutes 57 seconds]
2026-03-29T02:50:03Z Finishing: Finalize Job
2026-03-29T02:50:03Z The job running on agent Terraform-Set 4 stopped responding...
From a JDNextBuild-Set job:
2026-03-28T17:10:11Z Agent version: 4.270.0
...normal job output...
[freeze - no output, no error]
Agent Diagnostic Logs folder: empty (0 files)
WHAT WE HAVE RULED OUT
- Agent version change: 4.270.0 has been in use for months with no issue prior to 2026-03-27. No version change coincides with onset.
- Image-specific cause: both pools have completely different software (JDNextBuild: Docker, Node.js, .NET, helm, kubectl; Terraform: Terraform, PowerShell Az modules). Same symptom on both rules out anything image-related.
- Specific task: freezes occur at random points including trivial 2-second tasks. No single step is consistently involved.
- Burstable SKU throttling: JDNextBuild-Set is on non-burstable D4s_v3. Terraform-Set was moved from B2ms to D2s_v3; the issue predates and persists after this change.
- Custom DNS / networking: Azure-provided DNS, standard VNet with NSG. No custom DNS or unusual routing.
- No changes on our end: no image updates, pipeline changes, or infrastructure changes coincide with issue onset.
OBSERVATIONS
We have seen MDP pool report the agent as ready even if the pipeline it was running and is supposed to be running is not yet timed out. And we have seen MDP deprovision that agent while pipeline still reported running.
We have also seen MDP unable to provision more agents even if both concurrent self hosted jobs as well as max pool capacity allow. In that state, there's an MDP agent in the pool as Ready, but doesn't get allocated a pending job, until the failing one times out.
SIMILAR REPORT FOUND
A very similar problem was reported for MS-hosted agents on 2026-03-24:
https://learn.microsoft.com/en-us/answers/questions/5835366/ms-hosted-build-agent-acquisition-is-stuck-(outage
Key similarities:
- Agent stuck with no meaningful error
- Agent Diagnostic Logs folder is empty - identical to our MDP case
- Azure status page showed all green throughout
- Self-resolved after ~24 hours with no action taken
- Accepted answer: "none of the advice here helped, it just randomly started working again after 24h"
Our issue differs in that it has persisted for multiple days across two independent pools rather than resolving on its own and is getting gradually worse by the day.
RELATED GITHUB ISSUES
WORKAROUNDS APPLIED (do not resolve root cause)
- Reduced pipeline job timeouts (plan: 75 min, apply: 120 min) to free up concurrency slots faster after a freeze
- Increased MDP pool concurrency to provide more headroom while frozen agents hold slots
QUESTIONS
- Has anyone else experienced this pattern with MDP agents specifically (not MS-hosted)?
- Is there any way to get better diagnostics from inside the MDP VM when a freeze occurs? We cannot access MDP VMs directly as they run in Microsoft's managed subscription.
- Is there a known platform issue with MDP in East US around this timeframe and ongoing?
Any insight appreciated