Skip to content

Files

Latest commit

a325ae3 · Apr 15, 2025

History

History

tests

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jun 7, 2024
Jul 6, 2023
Mar 18, 2025
Aug 26, 2024
Dec 7, 2024
Mar 5, 2025
Jul 10, 2020
May 3, 2020
Jun 4, 2024
Feb 13, 2025
Apr 15, 2025
Jan 31, 2024
Feb 16, 2024
Jun 26, 2024
Mar 5, 2025
Jul 30, 2024
Sep 2, 2022
Oct 25, 2024
Aug 21, 2024
Aug 21, 2024
Jun 14, 2023
Mar 16, 2023
Oct 22, 2024
Nov 23, 2023
Aug 21, 2024
Jul 12, 2024
Jul 12, 2024
Nov 29, 2024
Aug 26, 2024
Feb 8, 2022
Feb 13, 2025
Mar 14, 2025
Mar 5, 2025
Feb 13, 2025
Jan 3, 2025
Jun 14, 2023
Feb 14, 2023
Mar 12, 2025
Nov 29, 2024
Nov 9, 2022
Jul 26, 2024
Apr 16, 2024
Sep 26, 2024
Nov 22, 2023
Nov 29, 2024
Nov 14, 2022
Mar 18, 2025

Add Dummy data test

Important In order to pass the load_dataset_<dataset_name> test, dummy data is required for all possible config names.

First we distinguish between datasets scripts that

  • A) have no config class and
  • B) have a config class

For A) the dummy data folder structure, will always look as follows:

  • dummy/<version>/dummy_data.zip, e.g. cosmos_qa/dummy/0.1.0/dummy_data.zip. For B) the dummy data folder structure, will always look as follows:
  • dummy/<config_name>/<version>/dummy_data.zip, e.g. squad/dummy/plain-text/1.0.0/dummy_data.zip.

Now the difficult part is to create the correct dummy_data.zip file.

Important When checking the dummy folder structure of already added datasets, always unzip dummy_data.zip. If a folder dummy_data is found next to dummy_data.zip, it is probably an old version and should be deleted. The tests only take the dummy_data.zip file into account.

Here we have to pay close attention to the _split_generators(self, dl_manager) function of the dataset script in question. There are three general possibilties:

  1. The dl_manager.download_and_extract() is given a single path variable of type str as its argument. In this case the file dummy_data.zip should unzip to the following structure: os.path.join("dummy_data", <additional-paths-as-defined-in-split-generations>) e.g. for sentiment140, the unzipped dummy_data.zip has the following dir structure dummy_data/testdata.manual.2009.06.14.csv and dummy_data/training.1600000.processed.noemoticon.csv.

Note if there are no <additional-paths-as-defined-in-split-generations>, then dummy_data should be the name of the single file. An example for this is the crime-and-punishment dataset script.

  1. The dl_manager.download_and_extract() is given a dictionary of paths of type str as its argument. In this case the file dummy_data.zip should unzip to the following structure: os.path.join("dummy_data", <value_of_dict>.split('/')[-1], <additional-paths-as-defined-in-split-generations>) e.g. for squad, the unzipped dummy_data.zip has the following dir structure dummy_data/dev-v1.1.json, etc...

Note if <value_of_dict> is a zipped file then the dummy data folder structure should contain the exact name of the zipped file and the following extracted folder structure. The file dummy_data.zip should never itself contain a zipped file since the dummy data is not unzipped by the MockDownloadManager during testing. E.g. check the dummy folder structure of hansards where the folders have to be named *.tar or the structure of wiki_split where the folders have to be named *.zip.

  1. The dl_manager.download_and_extract() is given a dictionary of lists of paths of type str as its argument. This is a very special case and has been seen only for the dataset ensli. In this case the values are simply flattened and the dummy folder structure is the same as in 2).