Document Similarity Using Cosine Similarity

Sandani Sesanika Fernando
3 min readSep 3, 2022

--

Here I considered 8 text documents are a set of news articles related to three 3 different news topics namely, Hurricane Gilbert Heads Toward Dominican Coast, IRA terrorist attack, and McDonald’s Opens First Restaurant in China.

So, I plan to determine Document Similarity; how similar two or more documents are concerning each other in this document collection.

Cosine Similarity

Cosine Similarity is a measurement that quantifies the similarity between two or more vectors. The cosine similarity is the cosine of the angle between vectors. The vectors are typically non-zero and are within an inner product space.

The cosine similarity is described mathematically as the division between the dot product of vectors and the product of the Euclidean norms or magnitude of each vector.

Cosine Similarity

The first step to this is preprocessing the data in the 8 documents.

  1. Removal of stop words. (These are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc.) and do not add much information to the text.)
  2. Removal of numbers and special characters. (Also, you can convert the numbers into ‘num’ and then continue)
  3. Convert all the letters in the documents to lowercase letters.

TfidfVectorizer

Transforms text to feature vectors that can be used as input to the estimator. vocabulary_ is a dictionary that converts each token (word) to a feature index in the matrix, each unique token gets a feature index.

Importing the Libraries
Uploading Files

Now, we can define the documents in the following way.

Reading the files

Next, convert the collection of raw documents to a matrix of TF-IDF features.

Cosine Similarities

According to cosine similarity, article 1 is most similar to article 8. Articles 2,3 and 7 are most similar to each other. And articles 4,5,6 are most similar to each other.

Let’s meet with another interesting topic later!

--

--

Sandani Sesanika Fernando
Sandani Sesanika Fernando

Written by Sandani Sesanika Fernando

An eccentric realist ✨ Occasionally plays with AI, mostly chill :)