Skip to content

JeffrinE/Inverted-Index-Search-Engine

Repository files navigation

A Document Search Engine project with TF-IDF.

Prerequisites


  • Python 3.5+
  • pip3
  • NLTK
  • Scikit-learn

1. Data Collection


Here, we are using a custom dataset with data scraped from No Starch Press. The dataset contains a collection of books published by the publication under tag Programming.

1.1 Data Cleaning:


In this step we clean the scraped data, removing any unnecessary characters.

special_chars = '''!()--[]{};:'"\\, <>./?@#$%^&*_~0123456789+='''''  
  
for file in pub_name:  
    word_sc_rm = ""  
    if len(file.split()) ==1 :  
        pub_list_special_rm.append(file)  
    else:  
        for a in file:  
            if a in special_chars:  
                word_sc_rm += ' '  
            else:  
                word_sc_rm += a  
        pub_list_special_rm.append(word_sc_rm)

1.2 Data Pre-processing


In this step, the cleaned data is pre-processed before creating the inverted index of tokens. The pre-processing pipeline includes tokenizing each sentence, removing stop words and finally stemming.

for name in pub_list_special_rm:  
    words = word_tokenize(name)  
    stem_word = ""  
    for a in words:  
        if a.lower() not in STOPWORDS:  
            stem_word += stemmer.stem(a) + ' '  
    pub_list_stemmed.append(stem_word.lower())

2.Indexing


An Inverted Index is created with each token of all sentences as keys and their indexes as values.

data_dict = {}  
  
for a in range(len(pub_list_stemmed)):  
    for b in pub_list_stemmed[a].split():  
        if b not in data_dict:  
            data_dict[b] = [a]  
        else:  
            data_dict[b].append(a)

Inverted Index


3. Search Engine


This Search Engine uses the TF-IDF algorithm. TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to calculate the weight of each word signifies the importance of the word in the document and corpus

3.1 Calculating ranking using Cosine Similarity.


It is the most common metric used to calculate the similarity between document text.

Generating TF-IDF using TfidfVectorizer


temp_file = tfidf.fit_transform(temp_file)  
cosine_output = cosine_similarity(temp_file, tfidf.transform(stem_word_file))  

Testing the function


search_data('python')

Result of similar documents for word "Python".

Conclusion


The search engine at the current stage has very limited capability. Using a vector encoder model would provide sematic search results that are similar in meaning while TF-IDF model doesn't understand words.

About

A Document Search Engine with TF-IDF.

Topics

Resources

Stars

Watchers

Forks

Languages