Moodle site on Azure going down during activity duplication — disk I/O issue or something else?

Question

Moodle site on Azure going down during activity duplication — disk I/O issue or something else?

Culture Shift Team 0

!!!!CUSTOMER SUPPORT HAS NOT PROVIDED ANY EXPLANATION OR SUPPORT FROM THE RIGHT DEPARTMENT NOW A FULL WEEK AFTER THE INCIDENT!!!

Hoping someone can help me make sense of some metrics and confirm whether our planned upgrade will actually fix the problem.

Our setup:

Azure VM: Standard B2als_v2 (2 vCPUs, 4GB RAM)
OS: Ubuntu 24.04
Disk: 64GB Premium SSD LRS, Max IOPS: 240
Moodle with H5P, SCORM, and standard activities
~100 active users currently, planning to onboard 350-500 more by October

The problem:

Duplicating activities (even lightweight text-only ones) is extremely slow — over a minute in some cases. Last week, duplicating a single SCORM activity caused the entire site to become inaccessible for ~10 minutes.

What the Azure metrics showed during the outage:

CPU: peaked at only 31%
Disk Write Operations: spiked to 630+/sec against a Max IOPS of 240 (well over 2x the disk limit)
Disk Write throughput: peaked at 1.59GiB
Disk Read throughput: peaked at 555KiB
Network Out: peaked at 77MiB — not sure if this is relevant or abnormal

Planned upgrade:

We're considering moving to a D4as_v4 (4 vCPUs, 16GB RAM, local SCSI storage, higher IOPS) primarily because it addresses the disk I/O limitation. But we're not confident it can handle a 1.59GiB write spike — is that level of throughput normal for Moodle activity duplication, or does it suggest something else is going on?

Specific questions:

Is a 1.59GiB disk write spike normal for duplicating a single Moodle activity, or does that suggest a deeper issue?
Would the D4as_v4 be sufficient for our current load and planned growth to ~500 users, or should we be looking at something larger?
Any idea what might be causing the 77MiB Network Out spike during what should be a routine internal operation?
Is there anything else we should be checking in the Azure metrics before upgrading the VM?

Himanshu Shekhar 5,740 Reputation points Microsoft External Staff Moderator

2026-05-05T15:42:00.3033333+00:00
Culture Shift Team - Based on the performance data you shared (high write operations/bytes with comparatively low CPU), the most defensible interpretation is that the workload is hitting storage I/O limits (disk and/or VM I/O path limits), which can manifest as queuing and latency even when CPU and memory are not saturated. Azure disk performance is governed by both (1) the managed disk’s provisioned IOPS/throughput limits and (2) the VM size’s IOPS/throughput limits and I/O path (cached vs uncached**)**.

Why this can happen even with “normal” CPU - Azure documents that I/O to disks follows defined paths (uncached managed disk, cached managed disk, and local/temp disk where applicable) and is constrained by provisioning at multiple layers (disk‑level limits and VM‑level limits).

When the workload demand exceeds these limits, the platform enforces capping/throttling behavior, which typically appears as increased latency/queueing and reduced application responsiveness.

To keep the conclusion fully evidence‑based, the next step is to validate disk pressure using Azure’s documented VM/disk metrics for:

Disk latency and disk queue depth (OS and Data disk)

Read/Write operations per second and Read/Write bytes per second (OS and Data disk) These metrics are explicitly described in Microsoft documentation and are the supported way to correlate “slow VM/app” events with storage saturation symptoms.

VM resize (D4as_v4) - Moving to D4as_v4 is directionally reasonable from an Azure infrastructure standpoint because Microsoft documents higher uncached remote storage IOPS/throughput limits for the Dasv4 series compared with smaller/lower‑tier options. However, resizing the VM does not automatically remove disk‑level constraints. The disk itself retains its own IOPS/throughput limits (by disk type/size), and the effective performance remains the minimum of the relevant VM‑level and disk‑level limits for the I/O path in use. Therefore, a VM resize alone cannot be treated as a guaranteed fix unless post‑change monitoring confirms that disk/VM I/O limits are no longer being approached during the same workload operation.

Supported remediation options - From an Azure Support perspective, the defensible, Microsoft‑supported remediation options for an I/O bottleneck are:

Right‑size the disk tier/size to meet required IOPS/throughput (disk performance depends on disk type and provisioning). [learn.microsoft.com], [docs.azure.cn]

Right‑size the VM so VM‑level IOPS/throughput limits do not cap the disk (VM and disk limits must be considered together). [learn.microsoft.com], [learn.microsoft.com]

Use Azure’s published performance guidance to review factors that affect storage performance (latency, IOPS, throughput, caching considerations) and validate with measurements before/after the change. [learn.microsoft.com], [learn.microsoft.com]

Proceed with the planned resize to D4as_v4, and validate storage health using Azure Monitor disk metrics (latency, queue depth, IOPS, throughput) during the same high‑write operation you described. If the metrics indicate the disk is still near its provisioned limits, the supported next step is to adjust disk provisioning (tier/size) and/or distribute I/O across appropriately provisioned disks so the workload does not exceed disk‑level limits. [learn.microsoft.com], [learn.microsoft.com], [learn.microsoft.com], [docs.azure.cn]

References (Microsoft documentation)

Virtual machine and disk performance (limits, I/O paths, capping) [learn.microsoft.com]

Disk metrics (latency, queue depth, IOPS, throughput) [learn.microsoft.com]

Dasv4 series (uncached storage IOPS/throughput by size) [learn.microsoft.com]

Azure managed disk types (disk performance varies by type) [learn.microsoft.com]

Premium storage performance guidance [learn.microsoft.com]

Scalability/performance targets (avoid throttling with highly utilized disks; plan disk needs) [docs.azure.cn]d disks; plan disk needs) [docs.azure.cn]

1 answer

Your answer

Answer 1

The metrics indicate that disk performance limits are being exceeded and are the most likely cause of the outage. The behavior aligns with a disk IOPS/throughput bottleneck rather than CPU or RAM saturation.

Addressing the specific questions:

1.59 GiB write spike during a single duplication
From the available information, this is clearly abnormal relative to the provisioned disk performance, but the documentation does not define “normal” for Moodle specifically. What can be stated from the metrics model:

Premium SSD disks have fixed IOPS and throughput limits; when the application issues more I/O than the disk can sustain, latency increases sharply and the VM can appear unresponsive.
The metrics Disk Write Bytes / Disk Write Bytes/Sec and Disk Write Operations/Sec measure total bytes and operations written from all disks attached to the VM over time, not application‑level CRUD operations.
A spike to 630+ write operations/sec on a disk with a 240 IOPS limit means the VM is issuing more than double the supported IOPS, so the storage system will queue requests (Disk Queue Depth grows) and latency increases (OS Disk Latency / Data Disk Latency).

So the 1.59 GiB write spike indicates that the application and database activity during duplication are generating more disk writes than the current Premium SSD can handle. Whether this is due to Moodle configuration, database engine behavior, or plugins cannot be determined from the Azure metrics alone, but it is clearly a storage bottleneck.

Suitability of D4as_v4 for current and future load
The documentation describes how VM and disk performance interact:

Azure exposes metrics such as VM Cached and Uncached IOPS Consumed Percentage, OS Disk IOPS Consumed Percentage, and Data Disk IOPS Consumed Percentage to show how close the VM and disks are to their limits.
When these metrics reach 100%, application performance is capped and slowness occurs until the load drops or capacity is increased.

A move from a B‑series VM to a D‑series VM with higher IOPS and throughput limits (and more vCPUs/RAM) directly addresses the condition where VM Uncached IOPS Consumed Percentage or disk IOPS consumed percentage are at or near 100%. In the example from the documentation, when the VM’s uncached IOPS are at 100%, the recommended remediation is to move to a larger VM size that can handle the additional IO, after confirming that the attached disks are not the bottleneck.

Given that the current disk is exceeding its IOPS limit by >2x during duplication, a D4as_v4 with higher IOPS capacity and appropriate Premium SSD sizing is a valid and recommended direction. Whether it is sufficient for 500 users depends on the actual IOPS and throughput limits of the chosen disk SKU and how close the *IOPS Consumed Percentage and *Bandwidth Consumed Percentage metrics run under peak load. Those metrics should be checked after the change; if they approach 100% again, a further scale‑up or additional disks may be required.

77 MiB Network Out spike
The disk metrics documentation does not directly correlate network spikes with disk operations, but some points are clear:

Disk metrics are at the VM/disk level and do not log application‑level CRUD operations.
High disk write activity can coincide with increased network activity if the application is serving large responses, streaming content, or if backup/sync processes are running.

A 77 MiB Network Out spike during a period of heavy disk writes suggests that, while the duplication was running, the web server may also have been sending a significant amount of data to clients (for example, users loading SCORM packages or other content). However, the Azure metrics alone do not specify which process or endpoint generated that traffic.

What else to check in Azure metrics before upgrading
Before changing the VM size, the following metrics should be reviewed over the time window when slowness occurs:

OS Disk IOPS Consumed Percentage and OS Disk Bandwidth Consumed Percentage
Data Disk IOPS Consumed Percentage and Data Disk Bandwidth Consumed Percentage
VM Cached and Uncached IOPS Consumed Percentage
VM Cached and Uncached Bandwidth Percentage
OS Disk Latency and Data Disk Latency (if available for the controller type)
OS Disk Queue Depth and Data Disk Queue Depth

If these metrics show:

VM uncached IOPS or bandwidth at or near 100% while disk IOPS consumed percentage is moderate, the VM size is the bottleneck and scaling the VM (as in the documented example) is appropriate.
Disk IOPS or bandwidth consumed percentage near 100% while VM limits are not saturated, then the disk SKU/size is the bottleneck and upgrading the disk (larger Premium SSD, multiple disks, or enabling on‑demand bursting where applicable) is required.

Azure Copilot can also be used to analyze these metrics automatically by asking why the disk is slow, selecting the affected VM and disk, and specifying the time range of the incident. It will check the same metrics and indicate whether the VM or disk limits are being hit and recommend actions such as upgrading disks, enabling bursting, or adding disks.

References:

Share via

**Moodle site on Azure going down during activity duplication — disk I/O issue or something else?**

1 answer