How Bag of Words (BOW) Works in NLP

[ad_1]


In this article, we are going to learn about the most popular concept, bag of words (BOW) in NLP, which helps in converting the text data into meaningful numerical data

After converting the text data to numerical data, we can build machine learning or natural language processing models to get key insights from the text data.

Before that, Let’s take a step back and understand why NLP and NLU (Natural language understanding) are challenging compared to other machine learning or deep learning models.

According to a report, there are over 7110 languages across the globe. Meanwhile, just 23 languages account for more than half the world’s population.

People communicate in the form of texts, which is a combination of words. The text content that is being produced is so large that it becomes a necessity to get insights from the combination of words. Generally, we call it reviews

Natural Language Processing, which is abbreviated as NLP, helps us in understanding or rather helps in getting the key insights from the raw text.

It is unique because it processes unstructured data, which is highly rich in information and can be used for different purposes. 

I hope you’re excited to learn about the BOW and NLP. Before we dive further, let’s have a look at the topics you are going to learn.

NLP (Natural Language Processing ) is studied as a subset of artificial intelligence, which can also be understood from the above Venn diagram. This sub branch of artificial intelligence focuses more on getting key insights from text data.

Let’s see some of the key applications for natural language processing.

Applications of NLP

Following are the applications.

Survey/Sentiment Analysis

When writing a survey, sentiments are focused based on the feedback of the customer/user. It helps the organization to understand the feedback of customers/users. 

This sentiment analysis approach saves not only a lot of resources but also time. It can be implemented with a lot of NLP models like Bag of Words, TF-IDF, or even Neural Networks

However, promising results are achieved using BERT and Transformers.

Language translators like Google Translate

The concept of RNN is used here. Here LSTMs work really well. Let us say you want to convert a sentence from English to French. 

So the moment you change a word where gender is associated, the pronoun along with it would change automatically. This is because of NLP based RNNs that are being used for this change.

Autocorrect and Autocomplete text recommendations

The words that you have been using the most are recorded along with the order in which they are used. 

And when the system feels the order is being repeated, it starts suggesting and auto-completing the sentences. This not only records the words but also gives a great user experience.

One real-life example, these days, Gmail is giving suggestions based on the words you write in the mail.

Fake News Analyser

These kinds of applications help in categorising the text.

These kinds of models are trained by giving details like text and the label, whether it is fake or not. 

A model is then trained on this data, which helps in classifying whether the news/email is fake or not. It is widely used, like in Twitter and Google news section. 

One real-life example is email spam classification.

For any natural language processing model, the word corpus is a key thing. Let’s discuss that a bit more.

What Is Corpus?

NLP deals with the data in the form of texts. So the text-based dataset that we extract or we get is unstructured. We call it a corpus, which can be understood as a collection of words. 

The plural of the corpus is corpora, which also is its Latin derivation, which means “body.”When the corpus is labeled and structured properly, we call it labeled corpus.

Text Preprocessing Techniques

Below are the basic Text preprocessing techniques, which need to perform on the raw text before building any NLP model.

  1. Tokenization
  2. Stop words
  3. Stemming
  4. Lemmatization

Tokenization

It is a fundamental concept which deals with breaking texts or the corpus into phrases/sentences or words. And then stores them in a list.

For Example

Let our corpus be:

“This is a detailed article on Bag of Words with NLP. It is a beginner-friendly article!”

The sentence tokenize for the above text would be:

  • Sentence 1: This is a detailed article on Bag of Words with NLP
  • Sentence 2: It is a beginner-friendly article!

Stop Words

Dealing with text makes us understand that the complexity of processing it all out is proportional to the number of words we have. So bring the complexity to the bare minimum. We simply cannot remove words every now and then. 

However, we can surely discard the redundant ones which don’t really add meaning to the corpus. A list of such words is stored in stop words, which can be understood as a list of words that are supposed to be avoided. 

Stop words has a list of more than 30 languages. So for English, it automatically removes the words from text like “a,” ”an”,”the”,”to”,”for,” etc.

Stemming

As discussed earlier, we know that the lesser the words simpler the model would be for NLP based tasks.

In this, it is important for us to understand and find out how we can cut short the words so that the words with almost the same meaning are not repeated in the vocabulary. 

Vocabulary here means the list of different words. For this, we use stemming, which cuts shorts the words to their root word.

For example

History and Historical are identical and have almost the same meaning. So when the root word for both the words would be “histori.”

Let us take another example.

The root word of goes and going would be “go.”

Removing the suffix from the word to get the root word is called suffix stripping.

There are different stemmers provided by NLTK libraries in python like

  • Porter Stemmer, 
  • Snowball Stemmer,
  • Lancaster Stemmer, etc.

While implementing things in this article, we will be trying Porter Stemmer for the time being.

Lemmatization

Lemmatization is another text preprocessing concept which is more likely related to stemming. In the examples, we have seen the texts that are being converted to their root words have no meaning.

Lemmatization helps you keep those words intact, if not stripping. Lemmatization is considered better in terms of preprocessing but consumes more time than stemming.

We are having various other text preprocessing techniques to apply to the text. You can refer to the below article to learn 20+ popular text preprocessing techniques along with the implementations.

Word Embeddings

We have understood the need to reduce the word vector as much as possible, and we do it by text preprocessing techniques like stemming, stop words, etc. 

After this step, we have the step of word embedding. It can be understood as converting a word vector to numerics for better understanding by the system. 

There are different word embedding techniques:

  1. Binary Encoding
  2. TF-IDF
  3. Word2vec
  4. Latent Semantic Analysis encoding

We suggest you read the below article to learn about the popular word embedding techniques along with implementation.

All of these are useful embedding techniques, but in this article, we will be focusing more on binary encoding, which is used in Bag of Words. It basically marks the word vector 1 if it is present in the sentence else 0.

A detailed explanation of the same is given in the section below.

Understanding Bag of Words

As the name suggests, the concept is to create a bag of words from the clutter of words, which is also called as the corpus. 

It is the simplest form of representing words in the form of numbers. We convert the words to digits because the system needs the information in the form of numbers, or else it won’t be able to process the data.

We convert the words to numbers by analyzing the presence of the word in a particular sentence. 

A number is denoted as an encoded value against the word. This is the number of times that word has been represented in the sentence. 

If only the presence is to be considered, then the game is denoted in form 1’s and 0’s. When the word is present in the sentence, it is denoted as 1 else 0. This is called a binary bag of words.

Let us understand the Bag of words better with an example.

After the text preprocessing step, we will end up with the below sentences.

  • Document 1: read SVM algorithm article dataaspirant blog
  • Document 2: read randomforest algorithm article dataaspirant blog

Here we will be making a vocabulary that will consist of all the words used in the above two sentences. 

A bag of words is a place where it keeps records of the occurrence/presence of the word in that specific sentence. It is demonstrated below.

Read More …

[ad_2]


Write a comment