Skip to content

Add disk failure test to validate SR and VM resilience #312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rushikeshjadhav
Copy link
Contributor

Added test_linstor_sr_fail_disk which

  • Simulates failure of a LVM PV on a random host in the LINSTOR SR pool by offlining a selected disk.
  • Verifies VM start/shutdown on all hosts despite the degraded pool state.
  • Also ensures SR and PBDs recover after reboot of the affected host.

- Simulates failure of a LVM PV on a random host in the LINSTOR SR pool by offlining a selected disk.
- Verifies VM start/shutdown on all hosts despite the degraded pool state.
- Also ensures SR and PBDs recover after reboot of the affected host.

Signed-off-by: Rushikesh Jadhav <[email protected]>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this test break in a way that the teardown fails and leaves the pool in a bad state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer - No, because if test fails to mark a disk offline, it stays online and normal operations/teardown continues.

Long version -

In XOSTOR, each host has a dedicated LVM pool. Therefore, a disk failure within the pool effectively results in a failure of the entire Volume Group (VG) on that host — making it equivalent to a host-level failure from the storage perspective.

This test ensures that even if the VG on a host fails (while the host itself remains operational), the virtual machine (VM) should still be able to boot from any host — whether it's diskful or diskless. The goal is to confirm that a single disk or VG failure does not impact overall VM availability. (We can improvise test to consider an already running VMs scenario on the failing host.)

If the test fails due to issues like the disk not properly going offline, we reboot the affected host. In most cases, this brings the storage pool online and overall teardown is not affected.

An important caveat arises during failure conditions: if the random_host has open xcp-persistent-database and its VG fails due to disk loss, operations like VDI creation will fail. These will only resume once the xcp-persistent-database is reopened and functional on another healthy host — typically after rebooting the failed one. (We don't test additional VDI operations in this test.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to Draft as the caveat scenario needs extra work. Essentially when the xcp-persistent-database is InUse on failing disk-host, the VM.start operation gets stuck. As the Data and Metadata volume are not healthy underneath, normal SR operation does not work.

If VM.start, xcp-persistent-database is InUse, and failing disk-host are not together then the test works fine. This combination coming together is random.

@Wescoeur @Nambrok Can you review this scenario? and suggest if its known LINSTOR issue or a workaround can be applied to recover from hung VM.start case. For now, I'll use multiprocessing.Process to recover.

@rushikeshjadhav rushikeshjadhav marked this pull request as draft May 24, 2025 07:51
…e`, are on failing disk-host, then VM.start may get stuck.

The state can be recovered by bringing the failed device online however it means that the test failed.

Signed-off-by: Rushikesh Jadhav <[email protected]>
@rushikeshjadhav rushikeshjadhav force-pushed the feat-storage-linstor-627 branch from 33280e1 to 52d2a71 Compare May 26, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants