You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At save_to_disk step, the max_shard_size by default is 500MB. However, one row of the dataset might be larger than 500MB then the saving will throw an index error. Without looking at the source code, the bug is due to wrong calculation of number of shards which i think is total_size / min(max_shard_size, row_size) which should be total_size / max(max_shard_size, row_size)
The fix is setting a larger max_shard_size
Steps to reproduce the bug
create a dataset with large dense tensors per row
set a small max_shard_size say 1MB
save_to_disk
Expected behavior
raise IndexError(f"Index {index} out of range for dataset of size {size}.")
IndexError: Index 10 out of range for dataset of size 10.
I'm facing this problem while doing my translation of mteb/stackexchange-clustering. each row has lots of samples (up to 100k samples), because in this dataset, each row represent multiple clusters.
my hack is to setting max_shard_size to 20Gb or even larger
final_dataset.push_to_hub(
output_dataset,
private=True,
max_shard_size="20GB"# This will ensure appropriate sharding based on data size
)
Describe the bug
At
save_to_disk
step, themax_shard_size
by default is500MB
. However, one row of the dataset might be larger than500MB
then the saving will throw an index error. Without looking at the source code, the bug is due to wrong calculation of number of shards which i think istotal_size / min(max_shard_size, row_size)
which should betotal_size / max(max_shard_size, row_size)
The fix is setting a larger
max_shard_size
Steps to reproduce the bug
max_shard_size
say 1MBsave_to_disk
Expected behavior
Environment info
datasets
version: 2.16.0huggingface_hub
version: 0.20.2fsspec
version: 2023.12.2The text was updated successfully, but these errors were encountered: