Share via

**Moodle site on Azure going down during activity duplication — disk I/O issue or something else?**

Culture Shift Team 0 Reputation points
2026-05-05T14:50:01.2133333+00:00

!!!!CUSTOMER SUPPORT HAS NOT PROVIDED ANY EXPLANATION OR SUPPORT FROM THE RIGHT DEPARTMENT NOW A FULL WEEK AFTER THE INCIDENT!!!

Hoping someone can help me make sense of some metrics and confirm whether our planned upgrade will actually fix the problem.

Our setup:

  • Azure VM: Standard B2als_v2 (2 vCPUs, 4GB RAM)
  • OS: Ubuntu 24.04
  • Disk: 64GB Premium SSD LRS, Max IOPS: 240
  • Moodle with H5P, SCORM, and standard activities
  • ~100 active users currently, planning to onboard 350-500 more by October

The problem:

Duplicating activities (even lightweight text-only ones) is extremely slow — over a minute in some cases. Last week, duplicating a single SCORM activity caused the entire site to become inaccessible for ~10 minutes.

What the Azure metrics showed during the outage:

  • CPU: peaked at only 31%
  • Disk Write Operations: spiked to 630+/sec against a Max IOPS of 240 (well over 2x the disk limit)
  • Disk Write throughput: peaked at 1.59GiB
  • Disk Read throughput: peaked at 555KiB
  • Network Out: peaked at 77MiB — not sure if this is relevant or abnormal

Planned upgrade:

We're considering moving to a D4as_v4 (4 vCPUs, 16GB RAM, local SCSI storage, higher IOPS) primarily because it addresses the disk I/O limitation. But we're not confident it can handle a 1.59GiB write spike — is that level of throughput normal for Moodle activity duplication, or does it suggest something else is going on?

Specific questions:

  1. Is a 1.59GiB disk write spike normal for duplicating a single Moodle activity, or does that suggest a deeper issue?
  2. Would the D4as_v4 be sufficient for our current load and planned growth to ~500 users, or should we be looking at something larger?
  3. Any idea what might be causing the 77MiB Network Out spike during what should be a routine internal operation?
  4. Is there anything else we should be checking in the Azure metrics before upgrading the VM?
Azure Virtual Machines
Azure Virtual Machines

An Azure service that is used to provision Windows and Linux virtual machines.


1 answer

Sort by: Most helpful
  1. Q&A Assist
    2026-05-05T14:50:25.85+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The metrics indicate that disk performance limits are being exceeded and are the most likely cause of the outage. The behavior aligns with a disk IOPS/throughput bottleneck rather than CPU or RAM saturation.

    Addressing the specific questions:

    1. 1.59 GiB write spike during a single duplication
      From the available information, this is clearly abnormal relative to the provisioned disk performance, but the documentation does not define “normal” for Moodle specifically. What can be stated from the metrics model:
    • Premium SSD disks have fixed IOPS and throughput limits; when the application issues more I/O than the disk can sustain, latency increases sharply and the VM can appear unresponsive.
    • The metrics Disk Write Bytes / Disk Write Bytes/Sec and Disk Write Operations/Sec measure total bytes and operations written from all disks attached to the VM over time, not application‑level CRUD operations.
    • A spike to 630+ write operations/sec on a disk with a 240 IOPS limit means the VM is issuing more than double the supported IOPS, so the storage system will queue requests (Disk Queue Depth grows) and latency increases (OS Disk Latency / Data Disk Latency).

    So the 1.59 GiB write spike indicates that the application and database activity during duplication are generating more disk writes than the current Premium SSD can handle. Whether this is due to Moodle configuration, database engine behavior, or plugins cannot be determined from the Azure metrics alone, but it is clearly a storage bottleneck.

    1. Suitability of D4as_v4 for current and future load
      The documentation describes how VM and disk performance interact:
    • Azure exposes metrics such as VM Cached and Uncached IOPS Consumed Percentage, OS Disk IOPS Consumed Percentage, and Data Disk IOPS Consumed Percentage to show how close the VM and disks are to their limits.
    • When these metrics reach 100%, application performance is capped and slowness occurs until the load drops or capacity is increased.

    A move from a B‑series VM to a D‑series VM with higher IOPS and throughput limits (and more vCPUs/RAM) directly addresses the condition where VM Uncached IOPS Consumed Percentage or disk IOPS consumed percentage are at or near 100%. In the example from the documentation, when the VM’s uncached IOPS are at 100%, the recommended remediation is to move to a larger VM size that can handle the additional IO, after confirming that the attached disks are not the bottleneck.

    Given that the current disk is exceeding its IOPS limit by >2x during duplication, a D4as_v4 with higher IOPS capacity and appropriate Premium SSD sizing is a valid and recommended direction. Whether it is sufficient for 500 users depends on the actual IOPS and throughput limits of the chosen disk SKU and how close the *IOPS Consumed Percentage and *Bandwidth Consumed Percentage metrics run under peak load. Those metrics should be checked after the change; if they approach 100% again, a further scale‑up or additional disks may be required.

    1. 77 MiB Network Out spike
      The disk metrics documentation does not directly correlate network spikes with disk operations, but some points are clear:
    • Disk metrics are at the VM/disk level and do not log application‑level CRUD operations.
    • High disk write activity can coincide with increased network activity if the application is serving large responses, streaming content, or if backup/sync processes are running.

    A 77 MiB Network Out spike during a period of heavy disk writes suggests that, while the duplication was running, the web server may also have been sending a significant amount of data to clients (for example, users loading SCORM packages or other content). However, the Azure metrics alone do not specify which process or endpoint generated that traffic.

    1. What else to check in Azure metrics before upgrading
      Before changing the VM size, the following metrics should be reviewed over the time window when slowness occurs:
    • OS Disk IOPS Consumed Percentage and OS Disk Bandwidth Consumed Percentage
    • Data Disk IOPS Consumed Percentage and Data Disk Bandwidth Consumed Percentage
    • VM Cached and Uncached IOPS Consumed Percentage
    • VM Cached and Uncached Bandwidth Percentage
    • OS Disk Latency and Data Disk Latency (if available for the controller type)
    • OS Disk Queue Depth and Data Disk Queue Depth

    If these metrics show:

    • VM uncached IOPS or bandwidth at or near 100% while disk IOPS consumed percentage is moderate, the VM size is the bottleneck and scaling the VM (as in the documented example) is appropriate.
    • Disk IOPS or bandwidth consumed percentage near 100% while VM limits are not saturated, then the disk SKU/size is the bottleneck and upgrading the disk (larger Premium SSD, multiple disks, or enabling on‑demand bursting where applicable) is required.

    Azure Copilot can also be used to analyze these metrics automatically by asking why the disk is slow, selecting the affected VM and disk, and specifying the time range of the incident. It will check the same metrics and indicate whether the VM or disk limits are being hit and recommend actions such as upgrading disks, enabling bursting, or adding disks.


    References:

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.