Optimization of data initialization for large sparce datasets #11390

razdoburdin · 2025-04-07T12:38:50Z

This PR speed-ups data initialization for large sparce datasets being executed on multi-core CPUs by parallelizing the execution.
For bosch dataset this PR improve fitting time on 1.3x for 2x56cores system.

To avoid the race condition, I have also switched from using bitfields as missing flag to uint8_t.

trivialfis · 2025-04-14T11:08:43Z

Note to myself:

I have also switched from using bitfields as missing flag to uint8_t.

That is increasing memory usage.

razdoburdin · 2025-04-28T11:22:47Z

hi @trivialfis ,
what is your opinion about this optimization?

trivialfis · 2025-04-30T12:43:06Z

Apologies, coming back from a trip. Will look into the optimization.

trivialfis

Could you please provide some data on the effect of memory usage where there are semi-dense columns?

trivialfis · 2025-04-30T12:53:18Z

src/data/gradient_index.h

@@ -233,7 +233,7 @@ class GHistIndexMatrix {
  void PushAdapterBatchColumns(Context const* ctx, Batch const& batch, float missing,
                               size_t rbegin);

-  void ResizeIndex(const size_t n_index, const bool isDense);
+  void ResizeIndex(const size_t n_index, const bool isDense, int n_threads = 1);


Could you please share in which case nthread=1, and what are the other cases?

I fixed the code, no default value now.

trivialfis · 2025-04-30T12:54:49Z

src/common/ref_resource_view.h

+  auto ref = RefResourceView{resource->DataAs<T>(), n_elements, resource};
+
+  size_t block_size = n_elements / n_threads + (n_elements % n_threads > 0);
+  #pragma omp parallel num_threads(n_threads)


Is this faster than std::fill_n for primitive data? Seems unlikely..

It is, if number of elements is high. Significant speed-up for number of elements ~1e8-1e9.

trivialfis · 2025-04-30T12:59:20Z

src/common/column_matrix.h

+      ColumnBinT* begin = &local_index[feature_offsets_[fid]];
+      begin[rid] = bin_id - index_base_[fid];


These two lines look exactly the same as the following two lines.

I moved the first line outside branches. The second one differs.

trivialfis · 2025-04-30T12:59:50Z

src/common/column_matrix.h

 public:
  // get number of features
  [[nodiscard]] bst_feature_t GetNumFeature() const {
    return static_cast<bst_feature_t>(type_.size());
  }

  ColumnMatrix() = default;
-  ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold) {
-    this->InitStorage(gmat, sparse_threshold);
+  ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads = 1) {


in which case n_threads is 1, and what are the other cases?

trivialfis · 2025-04-30T13:00:45Z

src/common/column_matrix.h

-            SetBinSparse(bin_id, rid + base_rowid, fid, local_index);
-            ++k;
+
+      dmlc::OMPException exc;


Could you please add some comments to provide a high-level summary of the steps achieved in this section of the code?

I added some comments in the code, to make it more clear

razdoburdin · 2025-05-05T15:11:53Z

Could you please provide some data on the effect of memory usage where there are semi-dense columns?

I measured a peak-memory consumption for bosch dataset for 224 threads. Master branch: 10.06 GB, this PR: 10.31 GB.

trivialfis · 2025-05-12T20:52:08Z

Got the following results from synthesized dense data, memory usage measured by cgmemtime.

* master

[7]     Train-rmse:33.39066
Qdm train (sec) ended in: 25.732778310775757 seconds.
Trained for 8 iterations.
{'BenchIter': {'GetTrain (sec)': 27.98164939880371}, 'Qdm': {'Train-DMatrix-Iter (sec)': 91.10861659049988, 'train (sec)': 25.732778310775757}}

user: 1809.226 s
sys:   28.329 s
wall: 119.860 s
child_RSS_high:   37892596 KiB
group_mem_high:   37677792 KiB

* opt pr

[7]     Train-rmse:33.39066
Qdm train (sec) ended in: 25.054997444152832 seconds.
Trained for 8 iterations.
{'BenchIter': {'GetTrain (sec)': 28.075414180755615}, 'Qdm': {'Train-DMatrix-Iter (sec)': 93.2668731212616, 'train (sec)': 25.054997444152832}}

user: 1807.715 s
sys:   31.093 s
wall: 121.895 s
child_RSS_high:   45232596 KiB
group_mem_high:   45032396 KiB

That's a 20 percent increase (45032396 - 37677792) / 37677792 in memory usage for dense data. Are you sure you want this PR to go in? Asking since the memory usage has been a pain point for XGBoost for a very long time. We receive issues mostly about memory usage instead of computation time, so we care about it a lot.

n_samples: 16777216 (2 ** 24)
n_features: 512
density: 1.0
dtype: f32

I used my custom benchmark scripts here https://github.com/trivialfis/dxgb_bench.git (not very polished). I loaded the data using iterator with arrays stored in .npy files. In addition, QuantileDMatrix is used. Feel free to use your own benchmark scripts.

I can test other sparsity if needed.

razdoburdin · 2025-05-19T08:49:41Z

I was able to return to bitfield representation for missing indicator without loosing thread-safe access. It requires quite careful data management, but allows to combine benefits of parallelization and low memory consumption. Some additional memory should be allocated in this case for data alignment, but it is less than 4 bytes per feature in worse case.

trivialfis

Apologies for the slow response, will do some tests myself. Please see inlined comments.

src/common/column_matrix.h

trivialfis · 2025-05-29T08:06:02Z

src/common/column_matrix.h

@@ -195,34 +236,42 @@ class ColumnMatrix {
    }
  };

-  void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold);
+  void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads);

  template <typename ColumnBinT, typename BinT, typename RIdx>
  void SetBinSparse(BinT bin_id, RIdx rid, bst_feature_t fid, ColumnBinT* local_index) {


Is this function still used now that we have a new SetBinSparse?

The original SetBinSparse is also used

src/common/column_matrix.h

Co-authored-by: Jiaming Yuan <[email protected]>

Dmitry Razdoburdin and others added 3 commits April 4, 2025 02:49

optimise data initialisation

11d4b63

Merge branch 'dmlc:master' into dev/cpu/init_data_optimisation

4589533

linting

3464f8c

razdoburdin marked this pull request as draft April 7, 2025 13:00

Dmitry Razdoburdin and others added 15 commits April 8, 2025 00:29

changing the capture for inner lambdas

e211ab9

fix

9221573

set default

0a793e3

linting

e249a3b

fix test

1be6f5d

Merge branch 'dmlc:master' into dev/cpu/init_data_optimisation

396f4b3

submodule fix

55a89d7

fix for i386

8a15c70

proteckt thread-unsafe code

085627f

fix for multi-batch

b1e714f

fix compilation error

70fd6bc

remove critical section; avoid using of bit filds

edef9e7

tildy

606c537

return deleted code

6f885b0

remove unactual code

c0dbd7e

razdoburdin marked this pull request as ready for review April 11, 2025 09:29

trivialfis mentioned this pull request Apr 22, 2025

RAM problem using xgb.QuantileDMatrix with NA values #11421

Open

trivialfis reviewed Apr 30, 2025

View reviewed changes

Dmitry Razdoburdin added 2 commits May 5, 2025 07:11

address comments

1cb3693

fix calling ColumnMatrix constructor

560a67a

switch back to bitfield

0ac338e

linting

61b3878

trivialfis reviewed May 29, 2025

View reviewed changes

razdoburdin and others added 2 commits June 2, 2025 08:37

Update src/common/column_matrix.h

98ef541

Co-authored-by: Jiaming Yuan <[email protected]>

Update src/common/column_matrix.h

2b090e6

Co-authored-by: Jiaming Yuan <[email protected]>

		ColumnBinT* begin = &local_index[feature_offsets_[fid]];
		begin[rid] = bin_id - index_base_[fid];

Uh oh!

Optimization of data initialization for large sparce datasets #11390

Are you sure you want to change the base?

Optimization of data initialization for large sparce datasets #11390

Uh oh!

Conversation

razdoburdin commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented Apr 14, 2025

Uh oh!

razdoburdin commented Apr 28, 2025

Uh oh!

trivialfis commented Apr 30, 2025

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

razdoburdin commented May 5, 2025

Uh oh!

trivialfis commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

razdoburdin commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

razdoburdin commented Apr 7, 2025 •

edited

Loading

trivialfis commented May 12, 2025 •

edited

Loading

razdoburdin commented May 19, 2025 •

edited

Loading