Index error when data is large #6602

ChenchaoZhao · 2024-01-18T23:00:47Z

Describe the bug

At save_to_disk step, the max_shard_size by default is 500MB. However, one row of the dataset might be larger than 500MB then the saving will throw an index error. Without looking at the source code, the bug is due to wrong calculation of number of shards which i think is
total_size / min(max_shard_size, row_size) which should be total_size / max(max_shard_size, row_size)

The fix is setting a larger max_shard_size

Steps to reproduce the bug

create a dataset with large dense tensors per row
set a small max_shard_size say 1MB
save_to_disk

Expected behavior

raise IndexError(f"Index {index} out of range for dataset of size {size}.")

IndexError: Index 10 out of range for dataset of size 10.

Environment info

datasets version: 2.16.0
Platform: Linux-5.10.201-168.748.amzn2int.x86_64-x86_64-with-glibc2.26
Python version: 3.10.13
huggingface_hub version: 0.20.2
PyArrow version: 14.0.2
Pandas version: 2.1.4
fsspec version: 2023.12.2

The text was updated successfully, but these errors were encountered:

BaoLocPham · 2025-04-16T04:13:00Z

I'm facing this problem while doing my translation of mteb/stackexchange-clustering. each row has lots of samples (up to 100k samples), because in this dataset, each row represent multiple clusters.
my hack is to setting max_shard_size to 20Gb or even larger

final_dataset.push_to_hub(
        output_dataset, 
        private=True,
        max_shard_size="20GB"  # This will ensure appropriate sharding based on data size
    )

It will work, but depends on your data size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index error when data is large #6602

Index error when data is large #6602

ChenchaoZhao commented Jan 18, 2024

BaoLocPham commented Apr 16, 2025

Uh oh!

Index error when data is large #6602

Index error when data is large #6602

Comments

ChenchaoZhao commented Jan 18, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

BaoLocPham commented Apr 16, 2025

Uh oh!