Prerequisites

A Document Search Engine project with TF-IDF.

Prerequisites

Python 3.5+
pip3
NLTK
Scikit-learn

1. Data Collection

Here, we are using a custom dataset with data scraped from No Starch Press. The dataset contains a collection of books published by the publication under tag Programming.

1.1 Data Cleaning:

In this step we clean the scraped data, removing any unnecessary characters.

special_chars = '''!()--[]{};:'"\\, <>./?@#$%^&*_~0123456789+='''''  
  
for file in pub_name:  
    word_sc_rm = ""  
    if len(file.split()) ==1 :  
        pub_list_special_rm.append(file)  
    else:  
        for a in file:  
            if a in special_chars:  
                word_sc_rm += ' '  
            else:  
                word_sc_rm += a  
        pub_list_special_rm.append(word_sc_rm)

1.2 Data Pre-processing

In this step, the cleaned data is pre-processed before creating the inverted index of tokens. The pre-processing pipeline includes tokenizing each sentence, removing stop words and finally stemming.

for name in pub_list_special_rm:  
    words = word_tokenize(name)  
    stem_word = ""  
    for a in words:  
        if a.lower() not in STOPWORDS:  
            stem_word += stemmer.stem(a) + ' '  
    pub_list_stemmed.append(stem_word.lower())

2.Indexing

An Inverted Index is created with each token of all sentences as keys and their indexes as values.

data_dict = {}  
  
for a in range(len(pub_list_stemmed)):  
    for b in pub_list_stemmed[a].split():  
        if b not in data_dict:  
            data_dict[b] = [a]  
        else:  
            data_dict[b].append(a)

Inverted Index

3. Search Engine

This Search Engine uses the TF-IDF algorithm. TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to calculate the weight of each word signifies the importance of the word in the document and corpus

3.1 Calculating ranking using Cosine Similarity.

It is the most common metric used to calculate the similarity between document text.

Generating TF-IDF using TfidfVectorizer

temp_file = tfidf.fit_transform(temp_file)  
cosine_output = cosine_similarity(temp_file, tfidf.transform(stem_word_file))

Testing the function

search_data('python')

Result of similar documents for word "Python".

Conclusion

The search engine at the current stage has very limited capability. Using a vector encoder model would provide sematic search results that are similar in meaning while TF-IDF model doesn't understand words.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md
indexer.py		indexer.py
publication_indexed_dictionary.json		publication_indexed_dictionary.json
publication_list_stemmed.json		publication_list_stemmed.json
requirements.txt		requirements.txt
scraper.py		scraper.py
scraper_results.json		scraper_results.json
searchData.py		searchData.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prerequisites

1. Data Collection

1.1 Data Cleaning:

1.2 Data Pre-processing

2.Indexing

Inverted Index

3. Search Engine

3.1 Calculating ranking using Cosine Similarity.

Generating TF-IDF using TfidfVectorizer

Testing the function

Conclusion

About

Uh oh!

Uh oh!

Languages

JeffrinE/Inverted-Index-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Prerequisites

1. Data Collection

1.1 Data Cleaning:

1.2 Data Pre-processing

2.Indexing

Inverted Index

3. Search Engine

3.1 Calculating ranking using Cosine Similarity.

Generating TF-IDF using TfidfVectorizer

Testing the function

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages