Skip to content

Index error when data is large #6602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ChenchaoZhao opened this issue Jan 18, 2024 · 1 comment
Open

Index error when data is large #6602

ChenchaoZhao opened this issue Jan 18, 2024 · 1 comment

Comments

@ChenchaoZhao
Copy link

Describe the bug

At save_to_disk step, the max_shard_size by default is 500MB. However, one row of the dataset might be larger than 500MB then the saving will throw an index error. Without looking at the source code, the bug is due to wrong calculation of number of shards which i think is
total_size / min(max_shard_size, row_size) which should be total_size / max(max_shard_size, row_size)

The fix is setting a larger max_shard_size

Steps to reproduce the bug

  1. create a dataset with large dense tensors per row
  2. set a small max_shard_size say 1MB
  3. save_to_disk

Expected behavior

raise IndexError(f"Index {index} out of range for dataset of size {size}.")

IndexError: Index 10 out of range for dataset of size 10.

Environment info

  • datasets version: 2.16.0
  • Platform: Linux-5.10.201-168.748.amzn2int.x86_64-x86_64-with-glibc2.26
  • Python version: 3.10.13
  • huggingface_hub version: 0.20.2
  • PyArrow version: 14.0.2
  • Pandas version: 2.1.4
  • fsspec version: 2023.12.2
@BaoLocPham
Copy link

I'm facing this problem while doing my translation of mteb/stackexchange-clustering. each row has lots of samples (up to 100k samples), because in this dataset, each row represent multiple clusters.
my hack is to setting max_shard_size to 20Gb or even larger

final_dataset.push_to_hub(
        output_dataset, 
        private=True,
        max_shard_size="20GB"  # This will ensure appropriate sharding based on data size
    )

It will work, but depends on your data size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants