Skip to content
This repository was archived by the owner on Feb 12, 2025. It is now read-only.

Ib/new datasets #2

Merged
merged 7 commits into from
Jan 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 3 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,94 +1,5 @@
# DPR Dataset Generator
# Vector Search Datasets

This repository provides code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models.
With the dense passage retriever (DPR) [[1]](#1), we encode text snippets from the C4 dataset [[2]](#2) to generate 768-dimensional vectors:
- context DPR embeddings for the base set and
- question DPR embeddings for the query set.
This repository provides code to generate several datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from recent deep learning models.
Please see the details of each dataset in the respective README files.

The metric for similarity search is inner product [[1]](#1).

The number of base and query embedding vectors is parametrizable.

## DPR10M

A specific instance with 10 million base vectors and 10,000 query vectors is introduced in [[3]](#3). Use the script [dpr_dataset_10M.py](dpr_dataset_10M.py) to generate this dataset. The corresponding ground-truth (available [here](gtruth_dpr10M_innerProduct.ivecs)) is generated conducting an exhaustive search with the inner product
metric.

Here is a summary of the **steps to generate this dataset**:

1. **Download the files** corresponding to the `en` variant of the C4 dataset accesible [here](https://huggingface.co/datasets/allenai/c4).
The complete set of files requires 350GB of storage, so you might want to follow the instructions to download only a subset. For example, to generate 10M embeddings
we used the first 2 files from the train set (i.e., files `c4-train.00000-of-01024.json.gz` and `c4-train.00001-of-01024.json.gz` in `c4/en/train`).

2. **Execute** the `generate_dpr_embeddings` function to generate a `.fvecs` file containing the new embeddings.
Note that different settings should be used to generate the **base vectors** and the **query set**, as they use the
DPR context and query encoders respectively.
See the script [dpr_dataset_10M.py](dpr_dataset_10M.py) for details.

```
# Example code to generate base vectors

from dpr_dataset_generate import generate_dpr_embeddings

base_C4_folder = '/home/username/research/datasets/c4/en' # Set this path to where your c4/en folder is located
cache_folder = f'/home/username/.cache/huggingface/datasets/' # Set to the hugginface datasets cache path
dataset_dir = f'{base_C4_folder}/train/'

num_embd = 10000000
init_file = 0
num_of_files = 2 # Make sure the input files (2 in this case) are enough to generate the
# requested number of embeddings.
# To get an estimate, use the optional parameter get_total_embeddings_only
# to get the number of embeddings that can be generated from a certain
# group of files without actually generating the embeddings.
fname_prefix_out = 'c4-en'
doc_stride = 32
max_length = 64
batch_size = 512
dim = 768

generate_dpr_embeddings(init_file, num_of_files, num_embd, doc_stride, max_length, dim,
batch_size,
dataset_dir, fname_prefix_out, cache_folder)
```
3. **Generate the ground-truth** by conducting an exhaustive search with the inner product metric.
We provide the [ground-truth](gtruth_dpr10M_innerProduct.ivecs) for the dataset generated using
[dpr_dataset_10M.py](dpr_dataset_10M.py).

> **_NOTE:_** Due to floating-point arithmetic precision the vector embeddings generated using the provided
> code in different machines may slightly vary. Keep in mind that this could cause small discrepancies with the provided ground-truth.

4. Functions `read_fvecs` and `read_ivecs` can be used to read `.fvecs` and `.ivecs` files respectively.

## References
Reference to cite when you use datasets generated with this code in a research paper:

```
@article{aguerrebere2023similarity,
title={Similarity search in the blink of an eye with compressed indices},
volume = {16},
number = {11},
pages = {3433--3446},
journal = {Proceedings of the VLDB Endowment},
author={Cecilia Aguerrebere and Ishwar Bhati and Mark Hildebrand and Mariano Tepper and Ted Willke},
year = {2023}
}
```

<a id="1">[1]</a>
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage
Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)

<a id="2">[2]</a>
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu,
P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)

<a id="3">[3]</a>
Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed
indices. In: Proceedings of the VLDB Endowment, 16, 11, 3433 - 3446. (2023)

This "research quality code" is for Non-Commercial purposes provided by Intel "As Is" without any express or implied
warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the
rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.
94 changes: 94 additions & 0 deletions dpr/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# DPR Dataset Generator

This repository provides code to generate base and query vector datasets for similarity search benchmarking and evaluation on high-dimensional vectors stemming from large language models.
With the dense passage retriever (DPR) [[1]](#1), we encode text snippets from the C4 dataset [[2]](#2) to generate 768-dimensional vectors:
- context DPR embeddings for the base set and
- question DPR embeddings for the query set.

The metric for similarity search is inner product [[1]](#1).

The number of base and query embedding vectors is parametrizable.

## DPR10M

A specific instance with 10 million base vectors and 10,000 query vectors is introduced in [[3]](#3). Use the script [dpr_dataset_10M.py](dpr_dataset_10M.py) to generate this dataset. The corresponding ground-truth (available [here](gtruth_dpr10M_innerProduct.ivecs)) is generated conducting an exhaustive search with the inner product
metric.

Here is a summary of the **steps to generate this dataset**:

1. **Download the files** corresponding to the `en` variant of the C4 dataset accesible [here](https://huggingface.co/datasets/allenai/c4).
The complete set of files requires 350GB of storage, so you might want to follow the instructions to download only a subset. For example, to generate 10M embeddings
we used the first 2 files from the train set (i.e., files `c4-train.00000-of-01024.json.gz` and `c4-train.00001-of-01024.json.gz` in `c4/en/train`).

2. **Execute** the `generate_dpr_embeddings` function to generate a `.fvecs` file containing the new embeddings.
Note that different settings should be used to generate the **base vectors** and the **query set**, as they use the
DPR context and query encoders respectively.
See the script [dpr_dataset_10M.py](dpr_dataset_10M.py) for details.

```
# Example code to generate base vectors

from dpr_dataset_generate import generate_dpr_embeddings

base_C4_folder = '/home/username/research/datasets/c4/en' # Set this path to where your c4/en folder is located
cache_folder = f'/home/username/.cache/huggingface/datasets/' # Set to the hugginface datasets cache path
dataset_dir = f'{base_C4_folder}/train/'

num_embd = 10000000
init_file = 0
num_of_files = 2 # Make sure the input files (2 in this case) are enough to generate the
# requested number of embeddings.
# To get an estimate, use the optional parameter get_total_embeddings_only
# to get the number of embeddings that can be generated from a certain
# group of files without actually generating the embeddings.
fname_prefix_out = 'c4-en'
doc_stride = 32
max_length = 64
batch_size = 512
dim = 768

generate_dpr_embeddings(init_file, num_of_files, num_embd, doc_stride, max_length, dim,
batch_size,
dataset_dir, fname_prefix_out, cache_folder)
```
3. **Generate the ground-truth** by conducting an exhaustive search with the inner product metric.
We provide the [ground-truth](gtruth_dpr10M_innerProduct.ivecs) for the dataset generated using
[dpr_dataset_10M.py](dpr_dataset_10M.py).

> **_NOTE:_** Due to floating-point arithmetic precision the vector embeddings generated using the provided
> code in different machines may slightly vary. Keep in mind that this could cause small discrepancies with the provided ground-truth.

4. Functions `read_fvecs` and `read_ivecs` can be used to read `.fvecs` and `.ivecs` files respectively.

## References
Reference to cite when you use datasets generated with this code in a research paper:

```
@article{aguerrebere2023similarity,
title={Similarity search in the blink of an eye with compressed indices},
volume = {16},
number = {11},
pages = {3433--3446},
journal = {Proceedings of the VLDB Endowment},
author={Cecilia Aguerrebere and Ishwar Bhati and Mark Hildebrand and Mariano Tepper and Ted Willke},
year = {2023}
}
```

<a id="1">[1]</a>
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W..: Dense Passage
Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP). 6769–6781. (2020)

<a id="2">[2]</a>
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu,
P.J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
In: The Journal of Machine Learning Research 21,140:1–140:67.(2020)

<a id="3">[3]</a>
Aguerrebere, C.; Bhati I.; Hildebrand M.; Tepper M.; Willke T.:Similarity search in the blink of an eye with compressed
indices. In: Proceedings of the VLDB Endowment, 16, 11, 3433 - 3446. (2023)

This "research quality code" is for Non-Commercial purposes provided by Intel "As Is" without any express or implied
warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the
rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.
File renamed without changes.
File renamed without changes.
File renamed without changes.
89 changes: 89 additions & 0 deletions wit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# wit-512-1M Dataset Generator

This repository provides code to generate base and query (test and learn sets) embeddings for similarity search benchmarking
and evaluation on high-dimensional vectors. The dataset is designed to benchmark similarity search methods under
scenarios with out-of-distribution (OOD) queries stemming from a text-to-image application [[1]](#1).

The WIT dataset[[2]](#2) is a multimodal multilingual dataset that contains 37 million rich image-text examples
extracted from Wikipedia pages. For each example in the first million training images
(downloaded from [here](https://storage.cloud.google.com/wikimedia-image-caption-public/image_data_train.tar)), we take the image and encode it
using the multimodal OpenAI CLIP-ViT-B32 model [[3]](#3) to generate a database vector.
We create the query set using the first 20K text descriptions in one of the provided test sets (concatenating
the Reference and Attribution description fields) and generating the corresponding embeddings using CLIPViT-B32-multilingual-v1 [[4]](#4).
The use of CLIP-ViT-B32 for images and multi-lingual CLIP-ViT-B32-multilingual-v1 for text follows the protocol suggested
[here]( https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1).
Finally, for each query, we compute the 100 ground truth nearest neighbors using maximum inner product.
We use the first 10K queries as a query test set and the remaining 10K as a learn set.

The metric for similarity search used with this dataset is inner product.


## Steps to generate the dataset

The script [wit_dataset_1M.py](wit_dataset_1M.py) generates 1 million base vectors from the provided training images
and two sets of queries, test and learn set, each with 10K vectors from the text descriptions of the provided test set.

Here is a summary of the **steps to generate this dataset**:

1. **Download the WIT training Images and test set**.
We download the training images from [here](https://storage.cloud.google.com/wikimedia-image-caption-public/image_data_train.tar) and extract them in the desired location
> **_NOTE:_** the above link requires Google login authentication to download the training images.

```
tar -xvf image_data_train.tar -C $BASE_PATH
```
The extracted files will be in $BASE_PATH/image_data_train/image_pixels/.
The images are encoded in base64 format, see image/file format details [here](https://www.kaggle.com/c/wikipedia-image-caption/data).

For queries, we download one of the [test set](https://storage.googleapis.com/gresearch/wit/wit_v1.test.all-00000-of-00005.tsv.gz) and extract it
```
mkdir -p $BASE_PATH/test_set
tar -xvfz wit_v1.test.all-00000-of-00005.tsv.gz -C $BASE_PATH/test_set
```

2. **Run** the `wit_dataset_1M.py` script to generate `.fvecs` files containing the base
, query and learn set vectors. **Remember to set the path** to the folder where the
downloaded training images, test files are located.

3. **Generate the ground-truth** by conducting an exhaustive search with the inner product metric.
We provide the ground-truth files `wit_test_gt_1M_innerproduct.ivecs` and `wit_learn_gt_1M_innerproduct.ivecs` for the test and learn sets, respectively.

4. Functions `read_fvecs` and `read_ivecs` can be used to read `.fvecs` and `.ivecs` files respectively.

> **_NOTE:_** Due to floating-point arithmetic precision the vector embeddings generated using the provided
> code in different machines may slightly vary. Keep in mind that this could cause small discrepancies with the provided ground-truth.


## References
Reference to cite when you use datasets generated with this code in a research paper:

```
@article{tepper2023leanvec,
title={LeanVec: Search your vectors faster by making them fit},
author={Mariano Tepper and Ishwar Singh Bhati and Cecilia Aguerrebere and Mark Hildebrand and Ted Willke},
year={2023},
journal={arXiv},
doi={https://doi.org/10.48550/arXiv.2312.16335}
}
```
<a id="1">[1]</a>
Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, Mark Hildebrand, and Ted Willke:
LeanVec: Search your vectors faster by making them fit. (2023)

<a id="2">[2]</a>
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork:
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. (2021)

<a id="3">[3]</a>
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever:
Learning Transferable Visual Models From Natural Language Supervision. (2021)

<a id="4">[4]</a>
Nils Reimers, and Iryna Gurevych:
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. (2020)

This "research quality code" is for Non-Commercial purposes provided by Intel "As Is" without any express or implied
warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the
rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.

5 changes: 5 additions & 0 deletions wit/requirements_wit.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
sentence-transformers==2.2.2
Pillow==9.4.0
numpy==1.26.2
typing_extensions==4.5.0
natsort==8.4.0
50 changes: 50 additions & 0 deletions wit/wit_dataset_1M.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from wit_dataset_generate import generate_image_embeddings, generate_text_embeddings


if __name__ == "__main__":
# This is an example script to generate the wit-512-1M dataset, containing:
#
# --- 1M base vectors from image embeddings
# --- out-of-distribution (OOD) query and learning sets (10k vectors each) from text embeddings
#
# Introduced in the paper ""LeanVec: Search your vectors faster by making them fit", 2023,
# Tepper, Bhati, Aguerrebere, Hildebrand, Willke (https://arxiv.org/abs/2312.16335)
#
#
# This dataset is created using a subset of Google's multimodal multilingual WIT dataset,
# using image-text examples extracted from Wikipedia pages (https://github.com/google-research-datasets/wit).
# To generate a base vector, we take the image and encode it using OpenAI CLIP-ViT-B32 model.
# For queries, we use text descriptions in one of the provided test sets
# (concatenating the Reference and Attribution description fields) and generating the corresponding
# embeddings using OpenAI CLIP-ViT-B32-multilingual-v1. We followed the steps suggested in
# https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1
#
# See the README for more details.
#
# Please see the documentation of the generate_image_embeddings and generate_text_embeddings functions
# for details on the required parameters.


base_path = '/raid0/ishwarsi/datasets/wit' # Path to the location of the WIT Images/Test datasets are located
images_dir = f'{base_path}/image_data_train/image_pixels' # Images files directory storing *.csv.gz files
output_dir = f'{base_path}/output' # Directory where the created dataset will be saved

# Files saved in [output_dir]/embeddings/{fname_prefix}.fvecs'

# Generate image embeddings from the images provided in the images_dir
fname_prefix = 'wit_base_1M'
num_vecs = 1000_000
generate_image_embeddings(images_dir, num_vecs, output_dir, fname_prefix)

test_file = f'{base_path}/test_set/wit_v1.test.all-00000-of-00005.tsv'

# Generate text embeddings from the test file
fname_prefix = 'wit_query_10k'
num_vecs = 10_000
generate_text_embeddings(test_file, num_vecs, output_dir, fname_prefix)

# Learn queries start from an offset (the last parameter)
fname_prefix = 'wit_learn_query_10k'
num_vecs = 10_000
num_skip = 10_000
generate_text_embeddings(test_file, num_vecs, output_dir, fname_prefix, num_skip)
Loading