Skip to content

Optimization of data initialization for large sparce datasets #11390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

razdoburdin
Copy link
Contributor

@razdoburdin razdoburdin commented Apr 7, 2025

This PR speed-ups data initialization for large sparce datasets being executed on multi-core CPUs by parallelizing the execution.
For bosch dataset this PR improve fitting time on 1.3x for 2x56cores system.

To avoid the race condition, I have also switched from using bitfields as missing flag to uint8_t.

@razdoburdin razdoburdin marked this pull request as draft April 7, 2025 13:00
@razdoburdin razdoburdin marked this pull request as ready for review April 11, 2025 09:29
@trivialfis
Copy link
Member

Note to myself:

I have also switched from using bitfields as missing flag to uint8_t.

That is increasing memory usage.

@razdoburdin
Copy link
Contributor Author

hi @trivialfis ,
what is your opinion about this optimization?

@trivialfis
Copy link
Member

Apologies, coming back from a trip. Will look into the optimization.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide some data on the effect of memory usage where there are semi-dense columns?

@@ -233,7 +233,7 @@ class GHistIndexMatrix {
void PushAdapterBatchColumns(Context const* ctx, Batch const& batch, float missing,
size_t rbegin);

void ResizeIndex(const size_t n_index, const bool isDense);
void ResizeIndex(const size_t n_index, const bool isDense, int n_threads = 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share in which case nthread=1, and what are the other cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the code, no default value now.

auto ref = RefResourceView{resource->DataAs<T>(), n_elements, resource};

size_t block_size = n_elements / n_threads + (n_elements % n_threads > 0);
#pragma omp parallel num_threads(n_threads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this faster than std::fill_n for primitive data? Seems unlikely..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, if number of elements is high. Significant speed-up for number of elements ~1e8-1e9.

Comment on lines 212 to 213
ColumnBinT* begin = &local_index[feature_offsets_[fid]];
begin[rid] = bin_id - index_base_[fid];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines look exactly the same as the following two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the first line outside branches. The second one differs.

public:
// get number of features
[[nodiscard]] bst_feature_t GetNumFeature() const {
return static_cast<bst_feature_t>(type_.size());
}

ColumnMatrix() = default;
ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold) {
this->InitStorage(gmat, sparse_threshold);
ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads = 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case n_threads is 1, and what are the other cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

SetBinSparse(bin_id, rid + base_rowid, fid, local_index);
++k;

dmlc::OMPException exc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some comments to provide a high-level summary of the steps achieved in this section of the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments in the code, to make it more clear

@razdoburdin
Copy link
Contributor Author

Could you please provide some data on the effect of memory usage where there are semi-dense columns?

I measured a peak-memory consumption for bosch dataset for 224 threads. Master branch: 10.06 GB, this PR: 10.31 GB.

@trivialfis
Copy link
Member

trivialfis commented May 12, 2025

Got the following results from synthesized dense data, memory usage measured by cgmemtime.

* master

[7]     Train-rmse:33.39066
Qdm train (sec) ended in: 25.732778310775757 seconds.
Trained for 8 iterations.
{'BenchIter': {'GetTrain (sec)': 27.98164939880371}, 'Qdm': {'Train-DMatrix-Iter (sec)': 91.10861659049988, 'train (sec)': 25.732778310775757}}

user: 1809.226 s
sys:   28.329 s
wall: 119.860 s
child_RSS_high:   37892596 KiB
group_mem_high:   37677792 KiB

* opt pr

[7]     Train-rmse:33.39066
Qdm train (sec) ended in: 25.054997444152832 seconds.
Trained for 8 iterations.
{'BenchIter': {'GetTrain (sec)': 28.075414180755615}, 'Qdm': {'Train-DMatrix-Iter (sec)': 93.2668731212616, 'train (sec)': 25.054997444152832}}

user: 1807.715 s
sys:   31.093 s
wall: 121.895 s
child_RSS_high:   45232596 KiB
group_mem_high:   45032396 KiB

That's a 20 percent increase (45032396 - 37677792) / 37677792 in memory usage for dense data. Are you sure you want this PR to go in? Asking since the memory usage has been a pain point for XGBoost for a very long time. We receive issues mostly about memory usage instead of computation time, so we care about it a lot.

  • n_samples: 16777216 (2 ** 24)
  • n_features: 512
  • density: 1.0
  • dtype: f32

I used my custom benchmark scripts here https://github.com/trivialfis/dxgb_bench.git (not very polished). I loaded the data using iterator with arrays stored in .npy files. In addition, QuantileDMatrix is used. Feel free to use your own benchmark scripts.

I can test other sparsity if needed.

@razdoburdin
Copy link
Contributor Author

razdoburdin commented May 19, 2025

I was able to return to bitfield representation for missing indicator without loosing thread-safe access. It requires quite careful data management, but allows to combine benefits of parallelization and low memory consumption. Some additional memory should be allocated in this case for data alignment, but it is less than 4 bytes per feature in worse case.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slow response, will do some tests myself. Please see inlined comments.

@@ -195,34 +236,42 @@ class ColumnMatrix {
}
};

void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold);
void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads);

template <typename ColumnBinT, typename BinT, typename RIdx>
void SetBinSparse(BinT bin_id, RIdx rid, bst_feature_t fid, ColumnBinT* local_index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function still used now that we have a new SetBinSparse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original SetBinSparse is also used

razdoburdin and others added 2 commits June 2, 2025 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants