An Azure service that is used to provision Windows and Linux virtual machines.
The metrics indicate that disk performance limits are being exceeded and are the most likely cause of the outage. The behavior aligns with a disk IOPS/throughput bottleneck rather than CPU or RAM saturation.
Addressing the specific questions:
- 1.59 GiB write spike during a single duplication
From the available information, this is clearly abnormal relative to the provisioned disk performance, but the documentation does not define “normal” for Moodle specifically. What can be stated from the metrics model:
- Premium SSD disks have fixed IOPS and throughput limits; when the application issues more I/O than the disk can sustain, latency increases sharply and the VM can appear unresponsive.
- The metrics
Disk Write Bytes/Disk Write Bytes/SecandDisk Write Operations/Secmeasure total bytes and operations written from all disks attached to the VM over time, not application‑level CRUD operations. - A spike to 630+ write operations/sec on a disk with a 240 IOPS limit means the VM is issuing more than double the supported IOPS, so the storage system will queue requests (
Disk Queue Depthgrows) and latency increases (OS Disk Latency/Data Disk Latency).
So the 1.59 GiB write spike indicates that the application and database activity during duplication are generating more disk writes than the current Premium SSD can handle. Whether this is due to Moodle configuration, database engine behavior, or plugins cannot be determined from the Azure metrics alone, but it is clearly a storage bottleneck.
- Suitability of D4as_v4 for current and future load
The documentation describes how VM and disk performance interact:
- Azure exposes metrics such as
VM Cached and Uncached IOPS Consumed Percentage,OS Disk IOPS Consumed Percentage, andData Disk IOPS Consumed Percentageto show how close the VM and disks are to their limits. - When these metrics reach 100%, application performance is capped and slowness occurs until the load drops or capacity is increased.
A move from a B‑series VM to a D‑series VM with higher IOPS and throughput limits (and more vCPUs/RAM) directly addresses the condition where VM Uncached IOPS Consumed Percentage or disk IOPS consumed percentage are at or near 100%. In the example from the documentation, when the VM’s uncached IOPS are at 100%, the recommended remediation is to move to a larger VM size that can handle the additional IO, after confirming that the attached disks are not the bottleneck.
Given that the current disk is exceeding its IOPS limit by >2x during duplication, a D4as_v4 with higher IOPS capacity and appropriate Premium SSD sizing is a valid and recommended direction. Whether it is sufficient for 500 users depends on the actual IOPS and throughput limits of the chosen disk SKU and how close the *IOPS Consumed Percentage and *Bandwidth Consumed Percentage metrics run under peak load. Those metrics should be checked after the change; if they approach 100% again, a further scale‑up or additional disks may be required.
- 77 MiB Network Out spike
The disk metrics documentation does not directly correlate network spikes with disk operations, but some points are clear:
- Disk metrics are at the VM/disk level and do not log application‑level CRUD operations.
- High disk write activity can coincide with increased network activity if the application is serving large responses, streaming content, or if backup/sync processes are running.
A 77 MiB Network Out spike during a period of heavy disk writes suggests that, while the duplication was running, the web server may also have been sending a significant amount of data to clients (for example, users loading SCORM packages or other content). However, the Azure metrics alone do not specify which process or endpoint generated that traffic.
- What else to check in Azure metrics before upgrading
Before changing the VM size, the following metrics should be reviewed over the time window when slowness occurs:
-
OS Disk IOPS Consumed PercentageandOS Disk Bandwidth Consumed Percentage -
Data Disk IOPS Consumed PercentageandData Disk Bandwidth Consumed Percentage -
VM Cached and Uncached IOPS Consumed Percentage -
VM Cached and Uncached Bandwidth Percentage -
OS Disk LatencyandData Disk Latency(if available for the controller type) -
OS Disk Queue DepthandData Disk Queue Depth
If these metrics show:
- VM uncached IOPS or bandwidth at or near 100% while disk IOPS consumed percentage is moderate, the VM size is the bottleneck and scaling the VM (as in the documented example) is appropriate.
- Disk IOPS or bandwidth consumed percentage near 100% while VM limits are not saturated, then the disk SKU/size is the bottleneck and upgrading the disk (larger Premium SSD, multiple disks, or enabling on‑demand bursting where applicable) is required.
Azure Copilot can also be used to analyze these metrics automatically by asking why the disk is slow, selecting the affected VM and disk, and specifying the time range of the incident. It will check the same metrics and indicate whether the VM or disk limits are being hit and recommend actions such as upgrading disks, enabling bursting, or adding disks.
References: