Skip to content

[WIP] Avoid R API calls within OpenMP region #11448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: release_1.7.0
Choose a base branch
from

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented May 10, 2025

@hcho3 hcho3 mentioned this pull request May 10, 2025
@david-cortes
Copy link
Contributor

Would it perhaps be better to backport the PR that fixed the same issue in the master branch?
#9823

@hcho3
Copy link
Collaborator Author

hcho3 commented May 10, 2025

Is that PR after the massive rewrite of the R package? If so, it probably won't apply to the 1.7 branch

@david-cortes
Copy link
Contributor

Is that PR after the massive rewrite of the R package? If so, it probably won't apply to the 1.7 branch

I think it was before the changes to .R files, but I don't remember. The thing is that it also fixed other issues along the way.

@david-cortes
Copy link
Contributor

Actually looking at the PR: might not be easy to backport, let's better stick with this one.

@@ -6,10 +6,10 @@
<parent>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost-jvm_2.12</artifactId>
<version>1.7.11</version>
<version>1.7.12</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is R-only, I guess changes here are not needed.

expect_equal(importance_from_dump(), importance, tolerance = 1e-6)

## decision stump
m <- xgboost::xgboost(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it'd be correct to use xgboost:: in tests. It's certainly not needed as the whole namespace is imported.


n_threads <- 2

test_that("training with feature weights works", {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say better not to add extra tests for this short-lived release, since they increase the chances of rejections and extra fixes being needed.

Copy link
Collaborator Author

@hcho3 hcho3 May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is merely restoring the tests removed in the previous commit, per the CRAN maintainer's request

@@ -247,15 +247,18 @@ XGB_DLL SEXP XGDMatrixSetInfo_R(SEXP handle, SEXP field, SEXP array) {
auto ctx = DMatrixCtx(R_ExternalPtrAddr(handle));
if (!strcmp("group", name)) {
std::vector<unsigned> vec(len);
const int* array_ptr = INTEGER_RO(array);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't notice that you have Depends: R (>= 3.3.0). INTEGER_RO and REAL_RO appeared in 3.5.0. You can either redefine them on older versions of R or use INTEGER and REAL.

@trivialfis
Copy link
Member

Thank you for working on this! Please let me know if there's anything I can help. (testing, submitting packages, etc).

@hcho3
Copy link
Collaborator Author

hcho3 commented May 10, 2025

@trivialfis Can you help with submitting the package with the fix?

@trivialfis
Copy link
Member

trivialfis commented May 11, 2025

Can you help with submitting the package with the fix?

Is this ready? (still marked as draft)

Since 1.7.11 never went through, do you think we need to bump to 1.7.12?

Lastly, did you manage to tackle issues with the changed openblas? There seems to be floating point accuracy related test failures.

@hcho3
Copy link
Collaborator Author

hcho3 commented May 11, 2025

Lastly, did you manage to tackle issues with the changed openblas? There seems to be floating point accuracy related test failures.

No, I was never able to reproduce the issue.

@david-cortes
Copy link
Contributor

Per CRAN recommendations:
https://cran.r-project.org/web/packages/policies.html

Increasing the version number at each submission reduces confusion so is preferred even when a previous submission was not accepted.

There seems to be floating point accuracy related test failures.

I don't think that's the case. The failures looked like completely different results than would be expected, not small numerical inaccuracies.

No, I was never able to reproduce the issue.

Perhaps one way could be by building R with flag --enable-strict-barrier, like in the email:
https://stat.ethz.ch/pipermail/r-package-devel/2025q2/011652.html

@hcho3
Copy link
Collaborator Author

hcho3 commented May 11, 2025

Perhaps one way could be by building R with flag --enable-strict-barrier, like in the email:

I already tried it and could not reproduce the other test failures.

@trivialfis Have the maintainer gotten back to you about reproducing the OpenBLAS environment? If not, let's ask for permission to remove the failing tests

@trivialfis
Copy link
Member

Have the maintainer gotten back to you about reproducing the OpenBLAS environment? If not, let's ask for permission to remove the failing tests

Need to separate the issues here. I'm not sure whether the issue fixed by this PR is the cause of all test failures.

How many test failures did you reproduce using the --enable-strict-barrier flag? We have multiple test failures. Did you manage to reproduce all of them using this flag, some of them, or none of them? Could you please share the check.log for enable-strict-barrier?

Are you using docker to build the R variant? If so, I can run tests as well.

@hcho3
Copy link
Collaborator Author

hcho3 commented May 12, 2025

How many test failures did you reproduce using the --enable-strict-barrier flag?

Only the tests that were failing due to memory protection error, e.g. test_feature_weights.R. These tests crash immediately when run with --enable-strict-barrier.

I will get back to you with the full list of tests that are affected by the memory issue.

@david-cortes
Copy link
Contributor

Have the maintainer gotten back to you about reproducing the OpenBLAS environment? If not, let's ask for permission to remove the failing tests

Need to separate the issues here. I'm not sure whether the issue fixed by this PR is the cause of all test failures.

How many test failures did you reproduce using the --enable-strict-barrier flag? We have multiple test failures. Did you manage to reproduce all of them using this flag, some of them, or none of them? Could you please share the check.log for enable-strict-barrier?

Are you using docker to build the R variant? If so, I can run tests as well.

In the case of memory corruptions, since all the tests are executed in the same process, it might not necessarily be the case that problems come from the specific test that fails, nor that failures would be deterministically reproducible.

Have the maintainer gotten back to you about reproducing the OpenBLAS environment?

They seldomly answer emails so I wouldn't count on getting a reply in due time.

@hcho3
Copy link
Collaborator Author

hcho3 commented May 12, 2025

@trivialfis Only the test in test_feature_weights.R is affected by the memory protection bug. Other tests run fine with --enable-strict-barrier flag enabled, but test_feature_weights.R crashes with segfault:

...
Fatal error: Wrong thread calling 'RunFinalizers'
Fatal error: Wrong thread calling 'RunFinalizers'
Fatal error: Wrong thread calling 'RunFinalizers'
Segmentation fault (core dumped)

And of course, I am not able to reproduce the other failing tests, e.g.

Failure ('test_interaction_constraints.R:36:3'): interaction constraints for regression
  `actual`:   FALSE
  `expected`: TRUE 
  Monotone constraint satisfied

I'd suggest the following course of action:

  1. Make a new submission with the memory fix (this PR) and see if it also fixes other tests.
  2. If the other tests still fail in the CRAN server, we should remove the failing tests and re-submit, since we have no way to reproduce them using our machines.

@trivialfis
Copy link
Member

trivialfis commented May 12, 2025

Asked for help again on the public mail list.

@trivialfis
Copy link
Member

It looks like CRAN maintainers accepted my previous package that removed the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants