How TF-IDF, Term Frequency-Inverse Document Frequency Works
For building any natural language model, the key challenge is how to convert the text data into numerical data. As the machine learning or deep learning models don’t understand the text data. One smart way to do the conversion is the TF-IDF method.
This TF-IDF method is a popular word embedding technique used in various natural language processing tasks. We have different other word embedding techniques to convert the text data to numerical data.
But In this article, we talk about TF-IDF. Which is in short for “term frequency–inverse document frequency.”
This numerical statistical technique is used to figure out the relevance of any word in the document, which is a part of an even larger body of documents.
Learn everything you need to know about TF-IDF #machinelearning #nlp #naturallanguageprocessing #datascience #python #tf-idf #tfidf
Evident from the name itself. It is composed of two different terms:
- TF: Measures how many times a word appears in the document.
- IDF: Represents how common the word is across the different documents.
This technique has many use-cases.
For example, TF-IDF is very popular for scoring the words in machine learning algorithms that work with textual data (for example, Natural Language Processing tasks like Email spam detection).
TF-IDF is considered as a weighting factor in tasks like information retrieval and data mining. We will talk about it in more detail in the coming sections.
The target audience for this article is not limited to just machine learning practitioners or researchers. But also people from domains like search engine optimization who wish to work with TF-IDF to improve their site/blog rankings.
Unlike the name suggests. We will see how this algorithm is not complicated at all.
The reader is not expected to have any prerequisite knowledge of the topic. But later in the section where the topic is discussed with code implementation, basic Python language knowledge can come in handy.
With this being said, let’s start the article. Below are the topics you are going to learn in this article, only if you read the complete article 🙂
Common NLP Terminologies
Before going further deep into the topic, it is imperative for us to understand the fundamental terminologies used in this context.
If these fundamental concepts are clear, it will be effortless to understand how the frequency–inverse document frequency converts the text data to the numerical data.
What is Corpus?
For different people, the term corpus can have various meanings, but we are interested in what it means to computational linguists.
A corpus (from the Latin word Corpora) can be thought of as a representative sample of the problem and a collection of text data. When this corpus is structured and annotated, it is more commonly known as a labeled corpus.
In the machine learning context, it refers to an extensive collection of text data from sources like books, news, webpages, etc., which forms the basis for various linguistic analysis tasks.
In short: Corpus is a collection of words, we are using for testing our hypothesis or for building models.
What is Term Frequency?
At heart, it is a measure of how frequently a term (word) occurs in a document. A simple way to calculate it is just to have a raw count of the number of times a certain term appeared in the document.
But there is a need to make some adjustments to not receive biased results. This is due to the fact that different documents can be of varying lengths, and it is possible that a term will appear more often in a longer document than in a shorter one.
To compensate for this ballgame, this frequency value needs to be normalized. This is done by dividing the raw count of the term in the document by the length of the document or by the count of the most common (frequent) term in that document.
More intuitively, it can be accounted like this:
TF(‘xyz’) = (Number of times term ‘xyz’ appeared in the document) / (Length of the document).
What is Inverse Document Frequency?
This measure tells how rare (or common) a term is in the entire set of documents (popularly known as a Corpus).
Intuitively, the more common a term is, the less important it will be. In this case the value of the IDF will be closer to 0.
Terms like ‘is,’ ‘are,’ and ‘of,’ they appear very often but have very little importance, which doesn’t get reflected properly in the TF measure.
However, for a rare word, it will be precisely contrary to the former. Such terms should be given more attention while measuring their relevance in the set of documents.
This measure is calculated by dividing the total number of documents by the number of documents containing that word, and then taking a logarithm, or more intuitively by estimating the following:
IDF(‘xyz’) = log_e (Number of documents in corpus / Number of documents with term ‘xyz’ in it)
Thus, to weigh down the frequent terms and scale up the not-so-common terms. An IDF factor needs to be incorporated to distinguish between the relevant vs. non-relevant terms as well as documents.
TF-IDF ( Term Frequency – Inverse Document Frequency )
Subsequently, the TF-IDF value can be calculated by taking a product of the two statistics: TF and IDF. Intuitively, this means:
TF-IDF (‘xyz’) = TF(‘xyz’) * IDF(‘xyz’)
A term will be considered highly relevant in a document and assigned a higher weight when it has a high TF value (in that document) and a low document frequency in the whole corpus.
The IDF value in that situation will be greater than 1 (since the log value is greater than 1), and thus the overall TF-IDF value will be large.
However, when a term is more common in the corpus, the TF-IDF value drops down closer to 0.
Note: While we only discussed calculating the TF and IDF values in the simplest way, there exists a number of different ways to calculate their exact value. Which will be beyond this introductory article.
We learned everything we need to compute the TF-IDF. Now Let’s take a toy kind of example to perform the calculation.
First, let’s see how we can calculate the TF-IDF value using the excel. Next we will see how we can calculate with simple python code.
TF-IDF Calculation in Excel
For this purpose, we are going to take 2 documents. Each contains 1 sentence. Considering this as input text, we will calculate the TF-IDF value.
Here we are considering two documents, each containing the below sentences.
- Document 1: I read the SVM algorithm article in dataaspirant blog
- Document 2: I read the randomforest algorithm article in dataaspirant blog
After the text preprocessing step, we will end up with the below sentences.
- Document 1: read SVM algorithm article dataaspirant blog
- Document 2: read randomforest algorithm article dataaspirant blog
- After preprocessing, we list down all the unique words for performing the TF-IDF calculation.
- As a first step, we count the number of times the word came in the documents.
- For example, for the word read appeared once in document-1 and once in the document-2.
- In the second step, we calculated the TF (term frequency)
- For example, for the word read, TF is 0.17, which is 1 (word count) / 6 (number of words in document-1)
- In the third step, we calculated the IDF inverse document frequency.
- For example for the word read IDF is 0, which is log(2 (number of documents) / 2 (In number of documents word read present))
- In the fourth step, we calculated the TF * IDF.
If you see clearly, the key difference between these two documents is the algorithm they are talking about.
This is clearly coming up in the TF-IDF calculation. For all other words, we are getting 0, and for SVM and randomforest words, we are getting some value.
TF-IDF Calculation With Simple Python Code
For this purpose, we are going to take 3 documents. Each contains 4 sentences. Considering this as input text, we will calculate the TF-IDF value.
You can scroll the above notebook to check each step you need to calculate the TF-IDF value.
Using TF-IDF in Machine Learning Models
In a very general sense, machine learning models are very powerful and always ready to be fed some data to bring out some useful and sensible outcomes.
However, a machine learning model is only as good as the type and quality of data it sees during training (and at other stages). It is, thus, our duty to make sure that the data is clean, well processed, and in a format which the model can understand easily.
The biggest hurdle with Natural language processing (NLP) is that the algorithms deal with numbers, while the data we have in NLP is text.
So, after performing the various text preprocessing steps like
- Stop-words removal etc.,
It is important that we convert the resulting text data into numbers (feature vector), which is understood by the model.
For this we having various word embedding techniques. But In this section, we will see how this can be done with the TF-IDF model.
The TF-IDF model scans through the documents one by one and creates word vectors by assigning a weight (relevance score) to each term. While penalizing the ones which were very frequent in each and every document and assigning them a lower score.
Once each word in the document is associated with a number, the model can then be provided with this word vector, and then one can hope from the model to deliver the expected results.
Few popular machine learning models used for the smaller text related text are the Navi bias algorithm and simple logistic regression models.
Just to give an example, one popular natural language processing, and deep learning algorithms application is text generation. In this case, given some sample words, we are going to create the text.
Here we have to predict the next word. For this, we use the RNN deep learning model.
A real-life application of TF-IDF
Being able to ascertain the relevance of a term in a document using TF-IDF can be useful in settings like information retrieval, data mining, search engines, etc.
Let’s say we have a company that deals with hotel room images and saves all its information in a database.
While serving a user who wants to look up rooms with “stripped wooden flooring,” the search engine has to decide which images will be returned in case none of them match the exact query.
To make it more intuitive, assume we have to select between the two partial matches:
- Dark wooden flooring
- Ruled hardwood flooring
Even though the first partial match has 2 out of 3 words matching those in the query, we have to understand how TF-IDF can help choose the second option (which we can see is a better match).
The Term frequency will be the same for each word, but the Inverse document frequency will create all the difference.
The term “flooring” will have low IDF values since it would appear in most of the hotel room listings, while the term “striped” will have a larger IDF since it would appear in only a few listings and thus has more discriminative power.
So, we return back the images against the 2nd partial match.
I hope we have given a clear understanding. Now let’s go to the fun part.
Implementing TF-IDF With Python Code
Having theoretical knowledge is one thing, but it is even better if one is also able to apply their understanding by implementing the same in code.
Let’s look briefly into a python implementation of TF-IDF class and try to understand the code, one block at a time.
Initialize the variables like
- number of documents,
- list of stopwords,
- a dictionary to store the frequency of how many documents a term is present in, etc.
If a corpus file or a stopword file is supplied along with, call the respective function to absorb this information.
For downloading the corpus file you can use the below link.
Needs to have a specific format where the first line tells about the number of documents.
While the following lines tell about the number of documents a certain term is present in.
For downloading the stopword file you can use the below link.
You should have one stopword every line.
Use this function to add a corpus to the existing corpus with the TF-IDF class. Firstly, parse the corpus file with the right settings (mode and encoding).
Read the first line to fetch the value of the number of documents and add it to the existing value num_docs.
Read the subsequent lines properly to fetch the token name and their frequency value, and all the information to the respective variables.
If the token is a known one, add the frequency value to the previously known value, or else create a new entry in the dictionary object.
Add the data from one new document to the existing information maintained by the class object. Increment the variable num_docs by 1, and update the frequency values for the individual tokens.
This function can be used to save the information maintained in the class object in a separate dump file in a file format similar to discussed above for corpus as well as stopwords file.
Firstly open the corpus file in write mode, append the num_docs value, and then write one token at a time with their frequency values.
Next, open the stopwords file, and then using a threshold value, identify how many (and which) of the tokens qualify as a potential stopword. Write those tokens in the stopwords file.
Determine the value of IDF for a specific token (or term) by using this function. The function first checks if the term is present in the list of tokens flagged as stopwords by the class object and, if so, returns 0.
Next, it checks and returns the frequency value corresponding to that token in the dictionary object term_num_docs.
If the token is not present there, a default constant IDF value is returned.
Finally, use the get_doc_keywords function to calculate the TF-IDF value for the given document. We start with identifying the unique set of tokens from the given document.
For the token set, we calculate TF and IDF values and then later multiply the two to get the TF-IDF value. The results returned are ordered by tokens having highest to lowest relevance.
Let’s run a small test script to verify the functionality.
Initially, we are creating an object of the class TfIdf. Further, we read the corpus and stopwords file (Observe in the output of how the data is actually stored).
Lastly, we calculate the TF-IDF values for every word in the corpus. Since we are working with only a single file in this example, the value of TF for every word in the same 1.0, which can be seen more in detail in the excerpt below.
Also, the words marked as stopwords will have a TF-IDF value of 0.
Below is the output.
While it is actively used in NLP related problems. TF-IDF can also prove to be a game-changer in establishing a refined content publishing strategy and further improving the search engine optimizations (SEO).
Running a TF-IDF analysis on your own website (or on a competitor’s website) might help discover some gaps which should be immediately fixed or discover some concepts which were never explored before.
For the web pages bringing the most traffic, it is important to know if they are the best version of themselves, and then work on further improving the content.
We know for a fact that machine learning algorithms work better with numerical data, and the TF-IDF algorithm acts as a bridge to connect such algorithms with the text data.
Text vectorization has played a very important part in NLP areas like
- sentiment analysis,
- information retrieval,
- text analysis,
- search engine optimization,
- text summarization,
- ranking the results to a query, etc.
Having a good understanding of how TF-IDF works will indeed also help gain clarity on how the various machine learning algorithms functions.
While we looked at one of the simplest ways of working with TF-IDF by summing the tf-idf for each query term, many more sophisticated ranking functions are variants of this simple model.
Recommended Natural Language Processing Courses
NLP Specialization with Python
NLP Classification and vector spaces
NLP Model Building With Python
About the Author: Nikhil Kumar
Nikhil is a creative and product-focused Data Scientist who likes to work on the most challenging as well as productive machine learning, and big data use cases.
He has eloquent experience while working with real-world challenges such as anomaly/intrusion detection, dynamic pricing, predicting customer intents, ranking search results, time-series classification, deriving insights from unstructured data such as images and text, making personalized recommendations, and innovating the user experience.
Previously he did his Masters of Science in Computer Engineering with a specialization in Signal, Image, and Speech Processing from the University of Paderborn (Germany)
Read More …