Add disk failure test to validate SR and VM resilience #312

rushikeshjadhav · 2025-05-21T15:11:43Z

Added test_linstor_sr_fail_disk which

Simulates failure of a LVM PV on a random host in the LINSTOR SR pool by offlining a selected disk.
Verifies VM start/shutdown on all hosts despite the degraded pool state.
Also ensures SR and PBDs recover after reboot of the affected host.

- Simulates failure of a LVM PV on a random host in the LINSTOR SR pool by offlining a selected disk. - Verifies VM start/shutdown on all hosts despite the degraded pool state. - Also ensures SR and PBDs recover after reboot of the affected host. Signed-off-by: Rushikesh Jadhav <[email protected]>

stormi · 2025-05-22T17:14:04Z

tests/storage/linstor/test_linstor_sr.py

Could this test break in a way that the teardown fails and leaves the pool in a bad state?

Short answer - No, because if test fails to mark a disk offline, it stays online and normal operations/teardown continues.

Long version -

In XOSTOR, each host has a dedicated LVM pool. Therefore, a disk failure within the pool effectively results in a failure of the entire Volume Group (VG) on that host — making it equivalent to a host-level failure from the storage perspective.

This test ensures that even if the VG on a host fails (while the host itself remains operational), the virtual machine (VM) should still be able to boot from any host — whether it's diskful or diskless. The goal is to confirm that a single disk or VG failure does not impact overall VM availability. (We can improvise test to consider an already running VMs scenario on the failing host.)

If the test fails due to issues like the disk not properly going offline, we reboot the affected host. In most cases, this brings the storage pool online and overall teardown is not affected.

An important caveat arises during failure conditions: if the random_host has open xcp-persistent-database and its VG fails due to disk loss, operations like VDI creation will fail. These will only resume once the xcp-persistent-database is reopened and functional on another healthy host — typically after rebooting the failed one. (We don't test additional VDI operations in this test.)

Moving this to Draft as the caveat scenario needs extra work. Essentially when the xcp-persistent-database is InUse on failing disk-host, the VM.start operation gets stuck. As the Data and Metadata volume are not healthy underneath, normal SR operation does not work.

If VM.start, xcp-persistent-database is InUse, and failing disk-host are not together then the test works fine. This combination coming together is random.

@Wescoeur @Nambrok Can you review this scenario? and suggest if its known LINSTOR issue or a workaround can be applied to recover from hung VM.start case. For now, I'll use multiprocessing.Process to recover.

…e`, are on failing disk-host, then VM.start may get stuck. The state can be recovered by bringing the failed device online however it means that the test failed. Signed-off-by: Rushikesh Jadhav <[email protected]>

rushikeshjadhav requested review from ydirson, Nambrok and stormi May 21, 2025 15:13

stormi reviewed May 22, 2025

View reviewed changes

rushikeshjadhav marked this pull request as draft May 24, 2025 07:51

Handle scenario where if VM.start, xcp-persistent-database is `InUs…

52d2a71

…e`, are on failing disk-host, then VM.start may get stuck. The state can be recovered by bringing the failed device online however it means that the test failed. Signed-off-by: Rushikesh Jadhav <[email protected]>

rushikeshjadhav force-pushed the feat-storage-linstor-627 branch from 33280e1 to 52d2a71 Compare May 26, 2025 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add disk failure test to validate SR and VM resilience #312

Add disk failure test to validate SR and VM resilience #312

Uh oh!

rushikeshjadhav commented May 21, 2025

Uh oh!

stormi May 22, 2025

Uh oh!

rushikeshjadhav May 23, 2025

Uh oh!

rushikeshjadhav May 26, 2025

Uh oh!

Uh oh!

Add disk failure test to validate SR and VM resilience #312

Are you sure you want to change the base?

Add disk failure test to validate SR and VM resilience #312

Uh oh!

Conversation

rushikeshjadhav commented May 21, 2025

Uh oh!

stormi May 22, 2025

Choose a reason for hiding this comment

Uh oh!

rushikeshjadhav May 23, 2025

Choose a reason for hiding this comment

Uh oh!

rushikeshjadhav May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!