|
1 |
| -# DPR Dataset Generator |
| 1 | +# Vector Search Datasets |
2 | 2 |
|
3 |
| -This repository provides code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models. |
4 |
| -With the dense passage retriever (DPR) [[1]](#1), we encode text snippets from the C4 dataset [[2]](#2) to generate 768-dimensional vectors: |
5 |
| -- context DPR embeddings for the base set and |
6 |
| -- question DPR embeddings for the query set. |
| 3 | +This repository provides code to generate several datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from recent deep learning models. |
| 4 | +Please see the details of each dataset in the respective README files. |
7 | 5 |
|
8 |
| -The metric for similarity search is inner product [[1]](#1). |
9 |
| - |
10 |
| -The number of base and query embedding vectors is parametrizable. |
11 |
| - |
12 |
| -## DPR10M |
13 |
| - |
14 |
| -A specific instance with 10 million base vectors and 10,000 query vectors is introduced in [[3]](#3). Use the script [dpr_dataset_10M.py](dpr_dataset_10M.py) to generate this dataset. The corresponding ground-truth (available [here](gtruth_dpr10M_innerProduct.ivecs)) is generated conducting an exhaustive search with the inner product |
15 |
| -metric. |
16 |
| - |
17 |
| -Here is a summary of the **steps to generate this dataset**: |
18 |
| - |
19 |
| -1. **Download the files** corresponding to the `en` variant of the C4 dataset accesible [here](https://huggingface.co/datasets/allenai/c4). |
20 |
| -The complete set of files requires 350GB of storage, so you might want to follow the instructions to download only a subset. For example, to generate 10M embeddings |
21 |
| -we used the first 2 files from the train set (i.e., files `c4-train.00000-of-01024.json.gz` and `c4-train.00001-of-01024.json.gz` in `c4/en/train`). |
22 |
| - |
23 |
| -2. **Execute** the `generate_dpr_embeddings` function to generate a `.fvecs` file containing the new embeddings. |
24 |
| - Note that different settings should be used to generate the **base vectors** and the **query set**, as they use the |
25 |
| - DPR context and query encoders respectively. |
26 |
| - See the script [dpr_dataset_10M.py](dpr_dataset_10M.py) for details. |
27 |
| - |
28 |
| -``` |
29 |
| -# Example code to generate base vectors |
30 |
| -
|
31 |
| -from dpr_dataset_generate import generate_dpr_embeddings |
32 |
| -
|
33 |
| -base_C4_folder = '/home/username/research/datasets/c4/en' # Set this path to where your c4/en folder is located |
34 |
| -cache_folder = f'/home/username/.cache/huggingface/datasets/' # Set to the hugginface datasets cache path |
35 |
| -dataset_dir = f'{base_C4_folder}/train/' |
36 |
| -
|
37 |
| -num_embd = 10000000 |
38 |
| -init_file = 0 |
39 |
| -num_of_files = 2 # Make sure the input files (2 in this case) are enough to generate the |
40 |
| - # requested number of embeddings. |
41 |
| - # To get an estimate, use the optional parameter get_total_embeddings_only |
42 |
| - # to get the number of embeddings that can be generated from a certain |
43 |
| - # group of files without actually generating the embeddings. |
44 |
| -fname_prefix_out = 'c4-en' |
45 |
| -doc_stride = 32 |
46 |
| -max_length = 64 |
47 |
| -batch_size = 512 |
48 |
| -dim = 768 |
49 |
| -
|
50 |
| -generate_dpr_embeddings(init_file, num_of_files, num_embd, doc_stride, max_length, dim, |
51 |
| - batch_size, |
52 |
| - dataset_dir, fname_prefix_out, cache_folder) |
53 |
| -``` |
54 |
| -3. **Generate the ground-truth** by conducting an exhaustive search with the inner product metric. |
55 |
| - We provide the [ground-truth](gtruth_dpr10M_innerProduct.ivecs) for the dataset generated using |
56 |
| - [dpr_dataset_10M.py](dpr_dataset_10M.py). |
57 |
| - |
58 |
| -> **_NOTE:_** Due to floating-point arithmetic precision the vector embeddings generated using the provided |
59 |
| -> code in different machines may slightly vary. Keep in mind that this could cause small discrepancies with the provided ground-truth. |
60 |
| -
|
61 |
| -4. Functions `read_fvecs` and `read_ivecs` can be used to read `.fvecs` and `.ivecs` files respectively. |
62 |
| - |
63 |
| -## References |
64 |
| -Reference to cite when you use datasets generated with this code in a research paper: |
65 |
| - |
66 |
| -``` |
67 |
| -@article{aguerrebere2023similarity, |
68 |
| - title={Similarity search in the blink of an eye with compressed indices}, |
69 |
| - volume = {16}, |
70 |
| - number = {11}, |
71 |
| - pages = {3433--3446}, |
72 |
| - journal = {Proceedings of the VLDB Endowment}, |
73 |
| - author={Cecilia Aguerrebere and Ishwar Bhati and Mark Hildebrand and Mariano Tepper and Ted Willke}, |
74 |
| - year = {2023} |
75 |
| -} |
76 |
| -``` |
77 |
| - |
78 |
| -<a id="1">[1]</a> |
79 |
| -Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage |
80 |
| -Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical |
81 |
| -Methods in Natural Language Processing (EMNLP). 6769–6781. (2020) |
82 |
| - |
83 |
| -<a id="2">[2]</a> |
84 |
| -Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, |
85 |
| -P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. |
86 |
| -In: The Journal of Machine Learning Research 21,140:1–140:67.(2020) |
87 |
| - |
88 |
| -<a id="3">[3]</a> |
89 |
| -Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed |
90 |
| -indices. In: Proceedings of the VLDB Endowment, 16, 11, 3433 - 3446. (2023) |
91 |
| - |
92 |
| -This "research quality code" is for Non-Commercial purposes provided by Intel "As Is" without any express or implied |
93 |
| -warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the |
94 |
| -rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities. |
0 commit comments