Document Similarity Using Cosine Similarity
Here I considered 8 text documents are a set of news articles related to three 3 different news topics namely, Hurricane Gilbert Heads Toward Dominican Coast, IRA terrorist attack, and McDonald’s Opens First Restaurant in China.
So, I plan to determine Document Similarity; how similar two or more documents are concerning each other in this document collection.
Cosine Similarity
Cosine Similarity is a measurement that quantifies the similarity between two or more vectors. The cosine similarity is the cosine of the angle between vectors. The vectors are typically non-zero and are within an inner product space.
The cosine similarity is described mathematically as the division between the dot product of vectors and the product of the Euclidean norms or magnitude of each vector.
The first step to this is preprocessing the data in the 8 documents.
- Removal of stop words. (These are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc.) and do not add much information to the text.)
- Removal of numbers and special characters. (Also, you can convert the numbers into ‘num’ and then continue)
- Convert all the letters in the documents to lowercase letters.
TfidfVectorizer
Transforms text to feature vectors that can be used as input to the estimator. vocabulary_ is a dictionary that converts each token (word) to a feature index in the matrix, each unique token gets a feature index.
Now, we can define the documents in the following way.
Next, convert the collection of raw documents to a matrix of TF-IDF features.
According to cosine similarity, article 1 is most similar to article 8. Articles 2,3 and 7 are most similar to each other. And articles 4,5,6 are most similar to each other.
Let’s meet with another interesting topic later!