Compare documents similarity using Python | NLP
Hey Reverse PY Network! In this post we are going to build a web application which will compare the similarity between two documents. We will learn the very basics of natural language processing (NLP) which is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language.
Let's start with the base structure of program but then we will add graphical interface to making the program much easier to use. Feel free to contribute this project in my GitHub.
NLTK and Gensim
Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which was written in Python and has a big community behind it. NLTK also is very easy to learn, actually, it’ s the easiest natural language processing (NLP) library that we are going to use. It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.
Gensim is billed as a Natural Language Processing package that does ‘Topic Modeling for Humans’. But it is practically much more than that. It is a leading and a state-of-the-art package for processing texts, working with word vector models (such as Word2Vec, FastText etc)
Topic models and word embedding are available in other packages like scikit, R etc. But the width and scope of facilities to build and evaluate topic models are unparalleled in gensim, plus many more convenient facilities for text processing. Another important benefit with gensim is that it allows you to manage big text files without loading the whole file into memory.
First, let's install nltk and gensim by following commands:
Tokenization of words (NLTK)
We use the method word_tokenize() to split a sentence into words. Take a look example below
Tokenization of sentences (NLTK)
An obvious question in your mind would be why sentence tokenization is needed when we have the option of word tokenization. We need to count average words per sentence, so for accomplishing such a task, we use sentence tokenization as well as words to calculate the ratio.
Now, you know how these methods is useful when handling text classification. Let's implement it in our similarity algorithm.
Open file and tokenize sentences
Create a .txt file and write 4-5 sentences in it. Include the file with the same directory of your Python program. Now, we are going to open this file with Python and split sentences.
Program will open file and read it's content. Then it will add tokenized sentences into the array for word tokenization.
Tokenize words and create dictionary
Once we added tokenized sentences in array, it is time to tokenize words for each sentence.
In order to work on text documents, Gensim requires the words (aka tokens) be converted to unique ids. So, Gensim lets you create a Dictionary object that maps each word to a unique id. Let's convert our sentences to a [list of words] and pass it to the corpora.Dictionary() object.
A dictionary maps every word to a number. Gensim lets you read the text and update the dictionary, one line at a time, without loading the entire text file into system memory.
Create a bag of words
The next important object you need to familiarize with in order to work in gensim is the Corpus (a Bag of Words). It is a basically object that contains the word id and its frequency in each document (just lists the number of times each word occurs in the sentence).
Note that, a ‘token’ typically means a ‘word’. A ‘document’ can typically refer to a ‘sentence’ or ‘paragraph’ and a ‘corpus’ is typically a ‘collection of documents as a bag of words’.
Now, create a bag of words corpus and pass the tokenized list of words to the Dictionary.doc2bow()
Let's assume that our documents are:
As you see we used "the" two times in second sentence and if you look word with id=12 (the) you will see that its frequency is 2 (appears 2 times in sentence)
Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.
Tf-Idf is calculated by multiplying a local component (TF) with a global component (IDF) and optionally normalizing the result to unit length. Term frequency is how often the word shows up in the document and inverse document frequency scales the value by how rare the word is in the corpus. In simple terms, words that occur more frequently across the documents get smaller weights.
The word ‘the’ occurs in two documents so it weighted down. The word ‘this’ and 'is' appearing in all three documents so removed altogether.
Creating similarity measure object
Now, we are going to create similarity object. The main class is Similarity, which builds an index for a given set of documents.The Similarity class splits the index into several smaller sub-indexes, which are disk-based. Let's just create similarity object then you will understand how we can use it for comparing.
We are storing index matrix in 'workdir' directory but you can name it whatever you want and of course you have to create it with same directory of your program.
Create Query Document
Once the index is built, we are going to calculate how similar is this query document to each document in the index. So, create second .txt file which will include query documents or sentences and tokenize them as we did before.
We get new documents (query documents or sentences) so it is possible to update an existing dictionary to include the new words.
Document similarities to query
At this stage, you will see similarities between the query and all index documents. To obtain similarities of our query document against the indexed documents:
Cosine measure returns similarities in the range <-1, 1> (the greater, the more similar).
Assume that our documents are:
and query document is:
As a result, we can see that third document is most similar
What's next? I think it is better to calculate average similarity of query document. At this time, we are going to import numpy to calculate sum of these similarity outputs.
Numpy will help us to calculate sum of these floats and output is:
To calculate average similarity we have to divide this value with count of documents
Now, we can say that query document (demofile2.txt) is 26% similar to main documents (demofile.txt)
What if we have more than one query documents?
As a solution, we can calculate sum of averages for each query document and it will give us overall similarity percentage.
Assume that our main document are:
By the way I am using random word generator tools to create these documents. Anyway, our query documents are:
Let's see the code:
We had 3 query documents and program computed average similarity for each of them. If we calculate these values result will:
We are formatting the value as percentage by multiplying it with 100 and rounding it to make a value simpler. The final result with Django:
Great! I hope you learned some basics of NLP from this project. In addition, I implemented this algorithm in Django for create graphical interface. Feel free to contribute project in my GitHub.
I hope you learned something from this lab 😃 and if you found it useful, please share it and join me on social media! As always Stay Connected!🚀
Support me by buying a cup of coffee. Thank you!Buy me a coffee