-
Notifications
You must be signed in to change notification settings - Fork 487
AArch64 unit test got blocked frequently in CI #3261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would it work if we add |
I think next time it happens we should attach strace / gdb to the stalled process to find out what it's doing. |
We occasionally have the aarch64 CI machine stalled when running unit tests with `cargo test`, which in turn will prevent CI jobs of all pending PRs from finishing. This patch provides a workaround with a given timeout. The timeout now is set as 600s given our CI jobs normally can finish the unit tests within 150s. Fixes: cloud-hypervisor#3261 Signed-off-by: Bo Chen <[email protected]>
That's right. Let's see whether we can understand the root-cause. I will hold the 'timeout' workaround for that. |
More observation:
I attached to the Some pending See the gdb printings: gdb.txt |
Created an issue in cargo community for help: rust-lang/cargo#10007 |
@MrXinWang @michael2012z The aarch64 CI is blocked again #3242, can you please kill the cargo process manually? Thank you. |
Done that. By the way, I received some feedback from cargo community about the solution/workaround to resolve the pending issue. Will try that later today. |
Any update on this? |
Tried to apply a workaround rust-lang/cargo#10007 (comment), but it didn't work. The problem was still seen when multiple unit tests were running in parallel. Recently, the CI workload was not heavy, so it didn't make much trouble. A potential fix was merged into kernel in 5.15 RC, but I am not sure if it works. And I think it's too early to update kernel for the fix, we'd better update to a stable version in future. |
Just now I installed kernel 5.16 on the arm64 CI server and rebooted. I will monitor if the new kernel can fix the pending issue in following days. |
The problem was not seen again after upgrading the kernel. I think it's time to close this issue. |
In recent weeks (not sure when it began), the AArch64 unit test job got pending frequently. When this happened, all subsequent PR CI were blocked. And we have to login to the CI server and kill the pending containers manually.
An example: https://cloud-hypervisor-jenkins.westus.cloudapp.azure.com/blue/organizations/jenkins/cloud-hypervisor/detail/PR-3236/4/pipeline/248
Any idea why this happen? Could it be a cargo issue?
The text was updated successfully, but these errors were encountered: