Files

.dvc
.github
benchmarks
docs
notebooks
src
templates
tests
- commands
- distributed_scripts
- features
- fixtures
- io
- packaged_modules
- README.md
- __init__.py
- _test_patching.py
- conftest.py
- test_arrow_dataset.py
- test_arrow_reader.py
- test_arrow_writer.py
- test_builder.py
- test_data_files.py
- test_dataset_dict.py
- test_dataset_list.py
- test_distributed.py
- test_download_manager.py
- test_exceptions.py
- test_experimental.py
- test_extract.py
- test_file_utils.py
- test_filelock.py
- test_filesystem.py
- test_fingerprint.py
- test_formatting.py
- test_hub.py
- test_info.py
- test_info_utils.py
- test_inspect.py
- test_iterable_dataset.py
- test_load.py
- test_metadata_util.py
- test_offline_util.py
- test_parallel.py
- test_patching.py
- test_py_utils.py
- test_search.py
- test_sharding_utils.py
- test_splits.py
- test_streaming_download_manager.py
- test_table.py
- test_tqdm.py
- test_upstream_hub.py
- test_version.py
- utils.py
utils
.dvcignore
.gitignore
.pre-commit-config.yaml
.zenodo.json
ADD_NEW_DATASET.md
AUTHORS
CITATION.cff
CODE_OF_CONDUCT.md
CONTRIBUTING.md
LICENSE
Makefile
README.md
SECURITY.md
pyproject.toml
setup.py

tests

Remove conditions for Python < 3.9 (#7474 )

Apr 15, 2025

a325ae3 · Apr 15, 2025

Name	Name	Last commit message	Last commit date
parent directory ..
commands	commands	Remove default `trust_remote_code=True` (#6954 )	Jun 7, 2024
distributed_scripts	distributed_scripts	Drop Python 3.7 support (#6005 )	Jul 6, 2023
features	features	Introduce pdf support (#7318 ) (#7325 )	Mar 18, 2025
fixtures	fixtures	Disable implicit token in CI (#7126 )	Aug 26, 2024
io	io	Faster parquet streaming + filters with predicate pushdown (#7309 )	Dec 7, 2024
packaged_modules	packaged_modules	Faster folder based builder + parquet support + allow repeated media …	Mar 5, 2025
README.md	README.md	Add text dataset (#356 )	Jul 10, 2020
__init__.py	__init__.py	add slow tests (#34 )	May 3, 2020
_test_patching.py	_test_patching.py	Re-enable import sorting disabled by flake8:noqa directive when using…	Jun 4, 2024
conftest.py	conftest.py	Support async functions in map() (#7384 )	Feb 13, 2025
test_arrow_dataset.py	test_arrow_dataset.py	Remove conditions for Python < 3.9 (#7474 )	Apr 15, 2025
test_arrow_reader.py	test_arrow_reader.py	Make split slicing consistent with list slicing (#5891 )	Jan 31, 2024
test_arrow_writer.py	test_arrow_writer.py	Revert the changes in `arrow_writer.py` from #6636 (#6664 )	Feb 16, 2024
test_builder.py	test_builder.py	Remove beam (#6987 )	Jun 26, 2024
test_data_files.py	test_data_files.py	Faster folder based builder + parquet support + allow repeated media …	Mar 5, 2025
test_dataset_dict.py	test_dataset_dict.py	Set load_from_disk path type as PathLike (#7081 )	Jul 30, 2024
test_dataset_list.py	test_dataset_list.py	add Dataset.from_list (#4890 )	Sep 2, 2022
test_distributed.py	test_distributed.py	Add IterableDataset.shard() (#7252 )	Oct 25, 2024
test_download_manager.py	test_download_manager.py	Use `huggingface_hub` cache (#7105 )	Aug 21, 2024
test_exceptions.py	test_exceptions.py	Remove deprecated code (#6996 )	Aug 21, 2024
test_experimental.py	test_experimental.py	Add parallel module using joblib for Spark (#5924 )	Jun 14, 2023
test_extract.py	test_extract.py	Less zip false positives (#5640 )	Mar 16, 2023
test_file_utils.py	test_file_utils.py	use huggingface_hub offline mode (#7244 )	Oct 22, 2024
test_filelock.py	test_filelock.py	Use `filelock` package for file locking (#6445 )	Nov 23, 2023
test_filesystem.py	test_filesystem.py	Remove deprecated code (#6996 )	Aug 21, 2024
test_fingerprint.py	test_fingerprint.py	Unblock NumPy 2.0 (#6991 )	Jul 12, 2024
test_formatting.py	test_formatting.py	Unblock NumPy 2.0 (#6991 )	Jul 12, 2024
test_hub.py	test_hub.py	update load_dataset doctring (#7301 )	Nov 29, 2024
test_info.py	test_info.py	Rename LargeList.dtype to LargeList.feature (#7106 )	Aug 26, 2024
test_info_utils.py	test_info_utils.py	Upgrade black to version ~=22.0 (#3691 )	Feb 8, 2022
test_inspect.py	test_inspect.py	Update docs (#7395 )	Feb 13, 2025
test_iterable_dataset.py	test_iterable_dataset.py	Fix resuming after `ds.set_epoch(new_epoch)` (#7451 )	Mar 14, 2025
test_load.py	test_load.py	Faster folder based builder + parquet support + allow repeated media …	Mar 5, 2025
test_metadata_util.py	test_metadata_util.py	Update docs (#7395 )	Feb 13, 2025
test_offline_util.py	test_offline_util.py	Bump hfh to 0.24 to fix ci (#7350 )	Jan 3, 2025
test_parallel.py	test_parallel.py	Add parallel module using joblib for Spark (#5924 )	Jun 14, 2023
test_patching.py	test_patching.py	Format code with `ruff` (#5519 )	Feb 14, 2023
test_py_utils.py	test_py_utils.py	Refactor `string_to_dict` to return `None` if there is no match inste…	Mar 12, 2025
test_search.py	test_search.py	update load_dataset doctring (#7301 )	Nov 29, 2024
test_sharding_utils.py	test_sharding_utils.py	Multiprocessed dataset builder [WIP] (#5107 )	Nov 9, 2022
test_splits.py	test_splits.py	Fix doc generation when NamedSplit is used as parameter default value (…	Jul 26, 2024
test_streaming_download_manager.py	test_streaming_download_manager.py	Extract data on the fly in packaged builders (#6784 )	Apr 16, 2024
test_table.py	test_table.py	Support JSON lines with missing columns (#7170 )	Sep 26, 2024
test_tqdm.py	test_tqdm.py	Better `tqdm` wrapper (#6433 )	Nov 22, 2023
test_upstream_hub.py	test_upstream_hub.py	update load_dataset doctring (#7301 )	Nov 29, 2024
test_version.py	test_version.py	Make `Version` hashable (#5238 )	Nov 14, 2022
utils.py	utils.py	Introduce pdf support (#7318 ) (#7325 )	Mar 18, 2025

README.md

Add Dummy data test

Important In order to pass the load_dataset_<dataset_name> test, dummy data is required for all possible config names.

First we distinguish between datasets scripts that

A) have no config class and
B) have a config class

For A) the dummy data folder structure, will always look as follows:

dummy/<version>/dummy_data.zip, e.g. cosmos_qa/dummy/0.1.0/dummy_data.zip. For B) the dummy data folder structure, will always look as follows:
dummy/<config_name>/<version>/dummy_data.zip, e.g. squad/dummy/plain-text/1.0.0/dummy_data.zip.

Now the difficult part is to create the correct dummy_data.zip file.

Important When checking the dummy folder structure of already added datasets, always unzip dummy_data.zip. If a folder dummy_data is found next to dummy_data.zip, it is probably an old version and should be deleted. The tests only take the dummy_data.zip file into account.

Here we have to pay close attention to the _split_generators(self, dl_manager) function of the dataset script in question. There are three general possibilties:

The dl_manager.download_and_extract() is given a single path variable of type str as its argument. In this case the file dummy_data.zip should unzip to the following structure: os.path.join("dummy_data", <additional-paths-as-defined-in-split-generations>) e.g. for sentiment140, the unzipped dummy_data.zip has the following dir structure dummy_data/testdata.manual.2009.06.14.csv and dummy_data/training.1600000.processed.noemoticon.csv.

Note if there are no <additional-paths-as-defined-in-split-generations>, then dummy_data should be the name of the single file. An example for this is the crime-and-punishment dataset script.

The dl_manager.download_and_extract() is given a dictionary of paths of type str as its argument. In this case the file dummy_data.zip should unzip to the following structure: os.path.join("dummy_data", <value_of_dict>.split('/')[-1], <additional-paths-as-defined-in-split-generations>) e.g. for squad, the unzipped dummy_data.zip has the following dir structure dummy_data/dev-v1.1.json, etc...

Note if <value_of_dict> is a zipped file then the dummy data folder structure should contain the exact name of the zipped file and the following extracted folder structure. The file dummy_data.zip should never itself contain a zipped file since the dummy data is not unzipped by the MockDownloadManager during testing. E.g. check the dummy folder structure of hansards where the folders have to be named *.tar or the structure of wiki_split where the folders have to be named *.zip.

The dl_manager.download_and_extract() is given a dictionary of lists of paths of type str as its argument. This is a very special case and has been seen only for the dataset ensli. In this case the values are simply flattened and the dummy folder structure is the same as in 2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

tests

tests

README.md

Add Dummy data test

Files

tests

Directory actions

More options

Directory actions

More options

Latest commit

History

tests

Folders and files

parent directory

README.md

Add Dummy data test