Skip to content

AArch64 unit test got blocked frequently in CI #3261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michael2012z opened this issue Oct 21, 2021 · 11 comments
Closed

AArch64 unit test got blocked frequently in CI #3261

michael2012z opened this issue Oct 21, 2021 · 11 comments

Comments

@michael2012z
Copy link
Member

In recent weeks (not sure when it began), the AArch64 unit test job got pending frequently. When this happened, all subsequent PR CI were blocked. And we have to login to the CI server and kill the pending containers manually.

An example: https://cloud-hypervisor-jenkins.westus.cloudapp.azure.com/blue/organizations/jenkins/cloud-hypervisor/detail/PR-3236/4/pipeline/248

Any idea why this happen? Could it be a cargo issue?

@likebreath
Copy link
Member

Would it work if we add timeout to the cargo test command from the run_unit_tests.sh? It should prevent from stalling the CI machine beyond the given timeout. Not sure about the root cause though.

@rbradford
Copy link
Member

I think next time it happens we should attach strace / gdb to the stalled process to find out what it's doing.

likebreath added a commit to likebreath/cloud-hypervisor that referenced this issue Oct 21, 2021
We occasionally have the aarch64 CI machine stalled when running unit
tests with `cargo test`, which in turn will prevent CI jobs of all
pending PRs from finishing. This patch provides a workaround with a
given timeout. The timeout now is set as 600s given our CI jobs normally
can finish the unit tests within 150s.

Fixes: cloud-hypervisor#3261

Signed-off-by: Bo Chen <[email protected]>
@likebreath
Copy link
Member

I think next time it happens we should attach strace / gdb to the stalled process to find out what it's doing.

That's right. Let's see whether we can understand the root-cause. I will hold the 'timeout' workaround for that.

@michael2012z
Copy link
Member Author

michael2012z commented Oct 22, 2021

More observation:

  • The issue is really random. I used a 100-time-loop to run the unit test, the problem was reproduced at the 79th shot.
  • It was only seen on the CI server. I run the unit test for hundreds times on my local server, never saw it.

I attached to the cargo process when it was pending, seemingly it stopped at a WRITE syscal: https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/write.c.html#26

Some pending rustc processes were also seen.

See the gdb printings: gdb.txt

@michael2012z
Copy link
Member Author

Created an issue in cargo community for help: rust-lang/cargo#10007

@likebreath
Copy link
Member

@MrXinWang @michael2012z The aarch64 CI is blocked again #3242, can you please kill the cargo process manually? Thank you.

@michael2012z
Copy link
Member Author

Done that.

By the way, I received some feedback from cargo community about the solution/workaround to resolve the pending issue. Will try that later today.

@rbradford
Copy link
Member

Done that.

By the way, I received some feedback from cargo community about the solution/workaround to resolve the pending issue. Will try that later today.

Any update on this?

@michael2012z
Copy link
Member Author

Tried to apply a workaround rust-lang/cargo#10007 (comment), but it didn't work. The problem was still seen when multiple unit tests were running in parallel.

Recently, the CI workload was not heavy, so it didn't make much trouble.

A potential fix was merged into kernel in 5.15 RC, but I am not sure if it works. And I think it's too early to update kernel for the fix, we'd better update to a stable version in future.

@michael2012z
Copy link
Member Author

michael2012z commented Jan 10, 2022

Just now I installed kernel 5.16 on the arm64 CI server and rebooted. I will monitor if the new kernel can fix the pending issue in following days.

@michael2012z
Copy link
Member Author

The problem was not seen again after upgrading the kernel. I think it's time to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants