IntelLabs · ibhati · Jan 5, 2024 · Jan 5, 2024 · Jan 5, 2024 · Jan 5, 2024
diff --git a/README.md b/README.md
@@ -1,94 +1,5 @@
-# DPR Dataset Generator
+# Vector Search Datasets
 
-This repository provides code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models.
-With the dense passage retriever (DPR) [[1]](#1), we encode text snippets from the C4 dataset [[2]](#2) to generate 768-dimensional vectors:
-- context DPR embeddings for the base set and
-- question DPR embeddings for the query set.
+This repository provides code to generate several datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from recent deep learning models.
+Please see the details of each dataset in the respective README files.
 
-The metric for similarity search is inner product [[1]](#1). 
-
-The number of base and query embedding vectors is parametrizable.
-
-## DPR10M
-
-A specific instance with 10 million base vectors and 10,000 query vectors is introduced in [[3]](#3). Use the script [dpr_dataset_10M.py](dpr_dataset_10M.py) to generate this dataset. The corresponding ground-truth (available [here](gtruth_dpr10M_innerProduct.ivecs)) is generated conducting an exhaustive search with the inner product 
-metric.
-
-Here is a summary of the **steps to generate this dataset**:
-
-1. **Download the files** corresponding to the `en` variant of the C4 dataset accesible [here](https://huggingface.co/datasets/allenai/c4). 
-The complete set of files requires 350GB of storage, so you might want to follow the instructions to download only a subset. For example, to generate 10M embeddings
-we used the first 2 files from the train set (i.e., files `c4-train.00000-of-01024.json.gz` and `c4-train.00001-of-01024.json.gz` in `c4/en/train`).
-
-2. **Execute** the `generate_dpr_embeddings` function to generate a `.fvecs` file containing the new embeddings. 
-   Note that different settings should be used to generate the **base vectors** and the **query set**, as they use the 
-   DPR context and query encoders respectively. 
-   See the script [dpr_dataset_10M.py](dpr_dataset_10M.py) for details.
-
-```
-# Example code to generate base vectors
-
-from dpr_dataset_generate import generate_dpr_embeddings
-
-base_C4_folder = '/home/username/research/datasets/c4/en'  # Set this path to where your c4/en folder is located
-cache_folder = f'/home/username/.cache/huggingface/datasets/'  # Set to the hugginface datasets cache path
-dataset_dir = f'{base_C4_folder}/train/'
-
-num_embd = 10000000
-init_file = 0
-num_of_files = 2  # Make sure the input files (2 in this case) are enough to generate the 
-                  # requested number  of embeddings. 
-                  # To get an estimate, use the optional parameter get_total_embeddings_only 
-                  # to get the number of embeddings that can be generated from a certain 
-                  # group of files without actually generating the embeddings.
-fname_prefix_out = 'c4-en'
-doc_stride = 32
-max_length = 64
-batch_size = 512
-dim = 768
-
-generate_dpr_embeddings(init_file, num_of_files, num_embd, doc_stride, max_length, dim,
-                        batch_size,
-                        dataset_dir, fname_prefix_out, cache_folder)
-```
-3. **Generate the ground-truth** by conducting an exhaustive search with the inner product metric. 
-   We provide the [ground-truth](gtruth_dpr10M_innerProduct.ivecs) for the dataset generated using 
-   [dpr_dataset_10M.py](dpr_dataset_10M.py).
-
-> **_NOTE:_**  Due to floating-point arithmetic precision the vector embeddings generated using the provided
-> code in different machines may slightly vary. Keep in mind that this could cause small discrepancies with the provided ground-truth.  
-
-4. Functions `read_fvecs` and `read_ivecs` can be used to read `.fvecs` and `.ivecs` files respectively.
-
-## References
-Reference to cite when you use datasets generated with this code in a research paper:
-
-```
-@article{aguerrebere2023similarity,
-        title={Similarity search in the blink of an eye with compressed indices},
-        volume = {16},
-        number = {11},    
-        pages = {3433--3446},    
-        journal = {Proceedings of the VLDB Endowment},
-        author={Cecilia Aguerrebere and Ishwar Bhati and Mark Hildebrand and Mariano Tepper and Ted Willke},        
-        year = {2023}
-}
-```
-
-<a id="1">[1]</a> 
-Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage 
-Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical 
-Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)
-
-<a id="2">[2]</a> 
-Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, 
-P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 
-In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)
-
-<a id="3">[3]</a>
-Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed
-indices. In: Proceedings of the VLDB Endowment, 16, 11, 3433 - 3446. (2023)
-
-This "research quality code"  is for Non-Commercial purposes provided by Intel "As Is" without any express or implied 
-warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the 
-rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.
diff --git a/dpr/README.md b/dpr/README.md
@@ -0,0 +1,94 @@
+# DPR Dataset Generator
+
+This repository provides code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models.
+With the dense passage retriever (DPR) [[1]](#1), we encode text snippets from the C4 dataset [[2]](#2) to generate 768-dimensional vectors:
+- context DPR embeddings for the base set and
+- question DPR embeddings for the query set.
+
+The metric for similarity search is inner product [[1]](#1). 
+
+The number of base and query embedding vectors is parametrizable.
+
+## DPR10M
+
+A specific instance with 10 million base vectors and 10,000 query vectors is introduced in [[3]](#3). Use the script [dpr_dataset_10M.py](dpr_dataset_10M.py) to generate this dataset. The corresponding ground-truth (available [here](gtruth_dpr10M_innerProduct.ivecs)) is generated conducting an exhaustive search with the inner product 
+metric.
+
+Here is a summary of the **steps to generate this dataset**:
+
+1. **Download the files** corresponding to the `en` variant of the C4 dataset accesible [here](https://huggingface.co/datasets/allenai/c4). 
+The complete set of files requires 350GB of storage, so you might want to follow the instructions to download only a subset. For example, to generate 10M embeddings
+we used the first 2 files from the train set (i.e., files `c4-train.00000-of-01024.json.gz` and `c4-train.00001-of-01024.json.gz` in `c4/en/train`).
+
+2. **Execute** the `generate_dpr_embeddings` function to generate a `.fvecs` file containing the new embeddings. 
+   Note that different settings should be used to generate the **base vectors** and the **query set**, as they use the 
+   DPR context and query encoders respectively. 
+   See the script [dpr_dataset_10M.py](dpr_dataset_10M.py) for details.
+
+```
+# Example code to generate base vectors
+
+from dpr_dataset_generate import generate_dpr_embeddings
+
+base_C4_folder = '/home/username/research/datasets/c4/en'  # Set this path to where your c4/en folder is located
+cache_folder = f'/home/username/.cache/huggingface/datasets/'  # Set to the hugginface datasets cache path
+dataset_dir = f'{base_C4_folder}/train/'
+
+num_embd = 10000000
+init_file = 0
+num_of_files = 2  # Make sure the input files (2 in this case) are enough to generate the 
+                  # requested number  of embeddings. 
+                  # To get an estimate, use the optional parameter get_total_embeddings_only 
+                  # to get the number of embeddings that can be generated from a certain 
+                  # group of files without actually generating the embeddings.
+fname_prefix_out = 'c4-en'
+doc_stride = 32
+max_length = 64
+batch_size = 512
+dim = 768
+
+generate_dpr_embeddings(init_file, num_of_files, num_embd, doc_stride, max_length, dim,
+                        batch_size,
+                        dataset_dir, fname_prefix_out, cache_folder)
+```
+3. **Generate the ground-truth** by conducting an exhaustive search with the inner product metric. 
+   We provide the [ground-truth](gtruth_dpr10M_innerProduct.ivecs) for the dataset generated using 
+   [dpr_dataset_10M.py](dpr_dataset_10M.py).
+
+> **_NOTE:_**  Due to floating-point arithmetic precision the vector embeddings generated using the provided
+> code in different machines may slightly vary. Keep in mind that this could cause small discrepancies with the provided ground-truth.  
+
+4. Functions `read_fvecs` and `read_ivecs` can be used to read `.fvecs` and `.ivecs` files respectively.
+
+## References
+Reference to cite when you use datasets generated with this code in a research paper:
+
+```
+@article{aguerrebere2023similarity,
+        title={Similarity search in the blink of an eye with compressed indices},
+        volume = {16},
+        number = {11},    
+        pages = {3433--3446},    
+        journal = {Proceedings of the VLDB Endowment},
+        author={Cecilia Aguerrebere and Ishwar Bhati and Mark Hildebrand and Mariano Tepper and Ted Willke},        
+        year = {2023}
+}
+```
+
+<a id="1">[1]</a> 
+Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage 
+Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical 
+Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)
+
+<a id="2">[2]</a> 
+Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, 
+P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 
+In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)
+
+<a id="3">[3]</a>
+Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed
+indices. In: Proceedings of the VLDB Endowment, 16, 11, 3433 - 3446. (2023)
+
+This "research quality code"  is for Non-Commercial purposes provided by Intel "As Is" without any express or implied 
+warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the 
+rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.
diff --git a/dpr_dataset_10M.py → dpr/dpr_dataset_10M.py b/dpr_dataset_10M.py → dpr/dpr_dataset_10M.py
diff --git a/dpr_dataset_generate.py → dpr/dpr_dataset_generate.py b/dpr_dataset_generate.py → dpr/dpr_dataset_generate.py
diff --git a/gtruth_dpr10M_innerProduct.ivecs → dpr/gtruth_dpr10M_innerProduct.ivecs b/gtruth_dpr10M_innerProduct.ivecs → dpr/gtruth_dpr10M_innerProduct.ivecs
diff --git a/requirements.txt → dpr/requirements_dpr.txt b/requirements.txt → dpr/requirements_dpr.txt
diff --git a/wit/README.md b/wit/README.md
@@ -0,0 +1,89 @@
+# wit-512-1M Dataset Generator
+
+This repository provides code to generate base and query (test and learn sets) embeddings for similarity search benchmarking
+and evaluation on high-dimensional vectors. The dataset is designed to benchmark similarity search methods under
+scenarios with out-of-distribution (OOD) queries stemming from a text-to-image application [[1]](#1).
+
+The WIT dataset[[2]](#2) is a multimodal multilingual dataset that contains 37 million rich image-text examples
+extracted from Wikipedia pages. For each example in the first million training images
+(downloaded from [here](https://storage.cloud.google.com/wikimedia-image-caption-public/image_data_train.tar)), we take the image and encode it
+using the multimodal OpenAI CLIP-ViT-B32 model [[3]](#3) to generate a database vector.
+We create the query set using the first 20K text descriptions in one of the provided test sets (concatenating
+the Reference and Attribution description fields) and generating the corresponding embeddings using CLIPViT-B32-multilingual-v1 [[4]](#4).
+The use of CLIP-ViT-B32 for images and multi-lingual CLIP-ViT-B32-multilingual-v1 for text follows the protocol suggested
+[here]( https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1).
+Finally, for each query, we compute the 100 ground truth nearest neighbors using maximum inner product.
+We use the first 10K queries as a query test set and the remaining 10K as a learn set.
+
+The metric for similarity search used with this dataset is inner product.
+
+
+## Steps to generate the dataset
+
+The script [wit_dataset_1M.py](wit_dataset_1M.py) generates 1 million base vectors from the provided training images
+and two sets of queries, test and learn set, each with 10K vectors from the text descriptions of the provided test set.
+
+Here is a summary of the **steps to generate this dataset**:
+
+1. **Download the WIT training Images and test set**.
+We download the training images from [here](https://storage.cloud.google.com/wikimedia-image-caption-public/image_data_train.tar) and extract them in the desired location
+> **_NOTE:_** the above link requires Google login authentication to download the training images.
+
+```
+tar -xvf image_data_train.tar -C $BASE_PATH
+```
+The extracted files will be in $BASE_PATH/image_data_train/image_pixels/.
+The images are encoded in base64 format, see image/file format details [here](https://www.kaggle.com/c/wikipedia-image-caption/data).
+
+For queries, we download one of the [test set](https://storage.googleapis.com/gresearch/wit/wit_v1.test.all-00000-of-00005.tsv.gz) and extract it
+```
+mkdir -p $BASE_PATH/test_set
+tar -xvfz wit_v1.test.all-00000-of-00005.tsv.gz -C $BASE_PATH/test_set
+```
+
+2. **Run** the `wit_dataset_1M.py` script to generate `.fvecs` files containing the base
+   , query and learn set vectors. **Remember to set the path** to the folder where the
+   downloaded training images, test files are located.
+
+3. **Generate the ground-truth** by conducting an exhaustive search with the inner product metric.
+   We provide the ground-truth files `wit_test_gt_1M_innerproduct.ivecs` and `wit_learn_gt_1M_innerproduct.ivecs` for the test and learn sets, respectively.
+
+4. Functions `read_fvecs` and `read_ivecs` can be used to read `.fvecs` and `.ivecs` files respectively.
+
+> **_NOTE:_**  Due to floating-point arithmetic precision the vector embeddings generated using the provided
+> code in different machines may slightly vary. Keep in mind that this could cause small discrepancies with the provided ground-truth.
+
+
+## References
+Reference to cite when you use datasets generated with this code in a research paper:
+
+```
+@article{tepper2023leanvec,
+        title={LeanVec: Search your vectors faster by making them fit},
+      	author={Mariano Tepper and Ishwar Singh Bhati and Cecilia Aguerrebere and Mark Hildebrand and Ted Willke},
+        year={2023},
+      	journal={arXiv},
+      	doi={https://doi.org/10.48550/arXiv.2312.16335}
+}
+```
+<a id="1">[1]</a>
+Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, Mark Hildebrand, and Ted Willke:
+LeanVec: Search your vectors faster by making them fit. (2023)
+
+<a id="2">[2]</a>
+Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork:
+WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. (2021)
+
+<a id="3">[3]</a>
+Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
+Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever:
+Learning Transferable Visual Models From Natural Language Supervision. (2021)
+
+<a id="4">[4]</a>
+Nils Reimers, and Iryna Gurevych:
+Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. (2020)
+
+This "research quality code"  is for Non-Commercial purposes provided by Intel "As Is" without any express or implied
+warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the
+rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.
+
diff --git a/wit/requirements_wit.txt b/wit/requirements_wit.txt
@@ -0,0 +1,5 @@
+sentence-transformers==2.2.2
+Pillow==9.4.0
+numpy==1.26.2
+typing_extensions==4.5.0
+natsort==8.4.0
diff --git a/wit/wit_dataset_1M.py b/wit/wit_dataset_1M.py
@@ -0,0 +1,50 @@
+from wit_dataset_generate import  generate_image_embeddings, generate_text_embeddings
+
+
+if __name__ == "__main__":
+    # This is an example script to generate the wit-512-1M dataset, containing:
+    #
+    #     --- 1M base vectors from image embeddings
+    #     --- out-of-distribution (OOD) query and learning sets (10k vectors each) from text embeddings
+    #
+    # Introduced in the paper ""LeanVec: Search your vectors faster by making them fit", 2023,
+    # Tepper, Bhati, Aguerrebere, Hildebrand, Willke (https://arxiv.org/abs/2312.16335)
+    #
+    #
+    # This dataset is created using a subset of Google's multimodal multilingual WIT dataset,
+    # using image-text examples extracted from Wikipedia pages (https://github.com/google-research-datasets/wit).
+    # To generate a base vector, we take the image and encode it using OpenAI CLIP-ViT-B32 model.
+    # For queries, we use text descriptions in one of the provided test sets
+    # (concatenating the Reference and Attribution description fields) and generating the corresponding
+    # embeddings using OpenAI CLIP-ViT-B32-multilingual-v1. We followed the steps suggested in
+    # https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1
+    #
+    # See the README for more details.
+    #
+    # Please see the documentation of the generate_image_embeddings and generate_text_embeddings functions
+    # for details on the required parameters.
+
+
+    base_path = '/raid0/ishwarsi/datasets/wit'  # Path to the location of the WIT Images/Test datasets are located
+    images_dir = f'{base_path}/image_data_train/image_pixels' # Images files directory storing *.csv.gz files
+    output_dir = f'{base_path}/output' # Directory where the created dataset will be saved
+
+    # Files saved in [output_dir]/embeddings/{fname_prefix}.fvecs'
+
+    # Generate image embeddings from the images provided in the images_dir
+    fname_prefix = 'wit_base_1M'
+    num_vecs = 1000_000
+    generate_image_embeddings(images_dir, num_vecs, output_dir, fname_prefix)
+
+    test_file = f'{base_path}/test_set/wit_v1.test.all-00000-of-00005.tsv'
+
+    # Generate text embeddings from the test file
+    fname_prefix = 'wit_query_10k'
+    num_vecs = 10_000
+    generate_text_embeddings(test_file, num_vecs, output_dir, fname_prefix)
+
+    # Learn queries start from an offset (the last parameter)
+    fname_prefix = 'wit_learn_query_10k'
+    num_vecs = 10_000
+    num_skip = 10_000
+    generate_text_embeddings(test_file, num_vecs, output_dir, fname_prefix, num_skip)