Skip to content

[release/9.0] Fix edge cases in Tarjan GC bridge (Android) #114682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: release/9.0-staging
Choose a base branch
from

Conversation

filipnavara
Copy link
Member

Backport of #112825, #112970, #113044, #113703 and #114637 to release/9.0-staging

/cc @BrzVlad @vitek-karas

Customer Impact

  • Customer reported
  • Found internally

Two customer issues on Android were traced back to bugs in the MonoVM Tarjan GC bridge responsible for bridging .NET and Java garbage collectors.

In the first issue (dotnet/android#9039) the runtime failed to report edges to the Android GC bridge code, resulting in alive objects collected in the Java runtime and their .NET peers throwing ObjectDisposedException. The particular pattern could have been triggered by code using await on the UI thread where the continuation was scheduled to run through synchronization context on the same thread.

In the second issue (#106410) the runtime crashed on an assertion for object graphs of certain shape. The cause of the crash was an incorrect assumption about duplicate edges in a compressed object graph and how later stages of the algorithm process them. When the later stages of the processing deduplicated the edges it may have resulted in an unforseen change to some of the graph properties leading to a logic error in the algorithm.

Regression

  • Yes
  • No

Both of these bugs existed in .NET MonoVM / Android code since the inception. Legacy Xamarin.Android offered 3 different bridge algorithms (old, new, and tarjan) with the new bridge being the default for a while. The default was changed to tarjan before MonoVM was merged into .NET.

Testing

Test cases reproducing both issues were presented and verified against the fix. Additionally, a new testing code was introduced to specifically test the problematic patterns as part of .NET runtime testing infrastructure.

Risk

Medium - the changes simply make reporting the graph more accurate. Even though we have high confidence in the changes, we recognize it does touch a sensitive area.

filipnavara and others added 6 commits April 15, 2025 13:54
…otnet#112970)

Fix typo in GC bridge comparison message (SCCS -> XREFS)
* [mono][sgen] Fix DUMP_GRAPH debug option build for tarjan bridge

* [mono][sgen] Don't create ScanData* during debug dumping of SCCs

It serves no purpose and it would later crash the runtime since we didn't patch the lockword back in place.

* [mono][sgen] Fix some null deref crashes in DUMP_GRAPH debug option

* [mono][tests] Add bridge tests

These are ported from some of the bridge tests we had on mono/mono. In order to test them we compare between the output of the new and the tarjan bridge.
…formation (dotnet#112825)

* Fix an edge case in the Tarjan SCC that lead to losing xref information

In the Tarjan SCC bridge processing there's a color graph used to find out
connections between SCCs. There was a rare case which only manifested when
a cycle in the object graph points to another cycle that points to a bridge
object. We only recognized direct bridge pointers but not pointers to other
non-bridge SCCs that in turn point to bridges and where we already calculated
the xrefs. These xrefs were then lost.

* Add test case to sgen-bridge-pathologies and add an assert to catch the original bug

* Add review

---------

Co-authored-by: Vlad Brezae <brezaevlad@gmail.com>
…duplication (dotnet#113044)

* [SGen/Tarjan] Handle edge case with node heaviness changing due to deduplication

Do early deduplication

Fix Windows build

Add test cases to sgen-bridge-pathologies

* Move test code

* Remove old code

* Add extra check (no change to functionality)
* [mono][sgen] Fix initialization of can_reduce_color

Mark it as true also in the case when the SCC contains bridge objects. Code populating other_colors for this SCC depends on it.

* Disable test on wasm and mobile

The test is meant to be run only on desktop.
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 15, 2025
@filipnavara filipnavara added Servicing-consider Issue for next servicing release review os-android area-GC-mono community-contribution Indicates that the PR has been added by a community member and removed area-GC-coreclr community-contribution Indicates that the PR has been added by a community member labels Apr 15, 2025
@filipnavara filipnavara requested review from steveisok and removed request for lambdageek and BrzVlad April 15, 2025 11:59
Copy link
Contributor

Tagging subscribers to this area: @BrzVlad
See info in area-owners.md if you want to be subscribed.

@vitek-karas
Copy link
Member

@BrzVlad could you please check that this PR contains all the fixes we need for this? We had to revert it last time around, so I want to make sure we have all the changes we know of right now. The plan is to take this for the next servicing release.

@BrzVlad
Copy link
Member

BrzVlad commented Apr 28, 2025

Note there is also a recent fix for thenew bridge in #115049. We could also include it but, given that issue wasn't customer reported and the new bridge is not the default, I don't feel particularly strong about it.

@filipnavara
Copy link
Member Author

filipnavara commented Apr 28, 2025

We could also include it but, given that issue wasn't customer reported and the new bridge is not the default, I don't feel particularly strong about it.

I was undecided about including #115049 and I am happy to go either way, or postpone it into some future servicing update.

It's fixing the "new" bridge which is not the default and improves the testing infrastructure. While it is not the default, it was a suggested workaround for some of the issues so there may be people running it. Notably, Uno.NET switched all the app templates to use "new" bridge some time ago.

@vitek-karas
Copy link
Member

Notably, Uno.NET switched all the app templates to use "new" bridge some time ago.

This might be interesting to revisit with them. My understanding is that in .NET 10 (and with this change maybe even in .NET 9) we've fixed everything we know about for the tarjan bridge. So hopefully Uno.NET would work as well.

Regardless, I'm in favor of treating the fix to the new bridge as a separate port. I'm not against fixing it, but I'd rather keep the servicing fixes focused (this one is already quite complex as is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-mono community-contribution Indicates that the PR has been added by a community member os-android Servicing-consider Issue for next servicing release review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants