-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Optimization of data initialization for large sparce datasets #11390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Optimization of data initialization for large sparce datasets #11390
Conversation
Note to myself:
That is increasing memory usage. |
hi @trivialfis , |
Apologies, coming back from a trip. Will look into the optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide some data on the effect of memory usage where there are semi-dense columns?
src/data/gradient_index.h
Outdated
@@ -233,7 +233,7 @@ class GHistIndexMatrix { | |||
void PushAdapterBatchColumns(Context const* ctx, Batch const& batch, float missing, | |||
size_t rbegin); | |||
|
|||
void ResizeIndex(const size_t n_index, const bool isDense); | |||
void ResizeIndex(const size_t n_index, const bool isDense, int n_threads = 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share in which case nthread=1
, and what are the other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the code, no default value now.
auto ref = RefResourceView{resource->DataAs<T>(), n_elements, resource}; | ||
|
||
size_t block_size = n_elements / n_threads + (n_elements % n_threads > 0); | ||
#pragma omp parallel num_threads(n_threads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this faster than std::fill_n
for primitive data? Seems unlikely..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, if number of elements is high. Significant speed-up for number of elements ~1e8-1e9.
src/common/column_matrix.h
Outdated
ColumnBinT* begin = &local_index[feature_offsets_[fid]]; | ||
begin[rid] = bin_id - index_base_[fid]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines look exactly the same as the following two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the first line outside branches. The second one differs.
src/common/column_matrix.h
Outdated
public: | ||
// get number of features | ||
[[nodiscard]] bst_feature_t GetNumFeature() const { | ||
return static_cast<bst_feature_t>(type_.size()); | ||
} | ||
|
||
ColumnMatrix() = default; | ||
ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold) { | ||
this->InitStorage(gmat, sparse_threshold); | ||
ColumnMatrix(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads = 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which case n_threads
is 1, and what are the other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
SetBinSparse(bin_id, rid + base_rowid, fid, local_index); | ||
++k; | ||
|
||
dmlc::OMPException exc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some comments to provide a high-level summary of the steps achieved in this section of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments in the code, to make it more clear
I measured a peak-memory consumption for |
Got the following results from synthesized dense data, memory usage measured by
That's a 20 percent increase
I used my custom benchmark scripts here https://github.com/trivialfis/dxgb_bench.git (not very polished). I loaded the data using iterator with arrays stored in I can test other sparsity if needed. |
I was able to return to bitfield representation for missing indicator without loosing thread-safe access. It requires quite careful data management, but allows to combine benefits of parallelization and low memory consumption. Some additional memory should be allocated in this case for data alignment, but it is less than 4 bytes per feature in worse case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the slow response, will do some tests myself. Please see inlined comments.
@@ -195,34 +236,42 @@ class ColumnMatrix { | |||
} | |||
}; | |||
|
|||
void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold); | |||
void InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold, int n_threads); | |||
|
|||
template <typename ColumnBinT, typename BinT, typename RIdx> | |||
void SetBinSparse(BinT bin_id, RIdx rid, bst_feature_t fid, ColumnBinT* local_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function still used now that we have a new SetBinSparse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original SetBinSparse is also used
Co-authored-by: Jiaming Yuan <[email protected]>
Co-authored-by: Jiaming Yuan <[email protected]>
This PR speed-ups data initialization for large sparce datasets being executed on multi-core CPUs by parallelizing the execution.
For
bosch
dataset this PR improve fitting time on 1.3x for 2x56cores system.To avoid the race condition, I have also switched from using bitfields as missing flag to uint8_t.