S2D Fault Domain Behavior

Question

S2D Fault Domain Behavior

Cem Iscen 20

Does Storage Spaces Direct treat a node in Suspend/Maintenance mode differently from a hard power off when calculating fault domain loss and write quorum?

0 comments

2 answers

Your answer

Answer 1

Yep - it does. A node placed into Suspend or Maintenance mode is a coordinated, cluster-aware state, whereas a hard power-off is treated as an unexpected failure. Storage Spaces Direct adjusts its behavior accordingly.

When you put a node into maintenance, the cluster intentionally drains roles and pauses I/O on that node. Storage Spaces Direct also tries to ensure data safety by redirecting writes and, depending on policy and health, may repair or rebalance data so that active copies remain on other nodes. The node is still part of cluster membership (paused), so the system has full knowledge of its state and can make controlled decisions about data placement and quorum. Because of that coordination, it avoids counting the node as a sudden fault domain loss in the same way as a crash, and it will block or warn if entering maintenance would violate resiliency guarantees.

With a hard power-off, the node disappears abruptly. The cluster immediately treats it as a fault domain loss. Storage Spaces Direct must assume any data on that node is unavailable and falls back to remaining replicas or parity. This directly impacts write quorum and resiliency calculations because the system now has fewer available copies, and it may restrict writes if the remaining redundancy is insufficient to meet the volume’s resiliency policy.

In practice, maintenance mode is designed to preserve write availability by ensuring enough healthy copies remain before and during the operation, while a power-off can transiently reduce the number of accessible copies and therefore more aggressively affect write quorum decisions.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

Answer 2

kholienchieh 0 MVP

In some scenario, especially evict of corrupted node which cause drive with "Transient error" and non of the PS commands work to resume the status.

The following script from GitHub helps.

https://github.com/MicrosoftDocs/windowsserverdocs/blob/main/WindowsServerDocs/storage/storage-spaces/media/troubleshooting/Clear-PhysicalDiskHealthData.txt

0 comments

Share via

S2D Fault Domain Behavior

2 answers

Your answer