20+ Popular NLP Text Preprocessing Techniques Implementation In Python

[ad_1]

NLP Text Preprocessing Techniques

Using the text preprocessing techniques we can remove noise from raw data and makes raw data more valuable for building models. 

Here, raw data is nothing but data we collect from different sources like reviews from websites, documents, social media, twitter tweets, news articles etc. 

Data preprocessing is the primary and most crucial step in any data science problems or project. Preprocessing the collected data is the integral part of any Natural Language Processing, Computer Vision, deep learning and machine learning problems. Based on the type of dataset, we have to follow different preprocessing methods. 

Which means machine learning data preprocessing techniques vary from the deep learning, natural language or nlp  data preprocessing techniques.

So there is a need to learn these techniques to build effective natural language processing models.

In this article we will discuss different text preprocessing techniques or methods like normalization, stemming, lemmatization, etc. for  handling text to build various Natural Language Processing problems/models. 

Popular Text Preprocessing Techniques Implementation in Python #nlp, #datascience #machinelearning



Click to Tweet

Moreover we don’t limit ourself with the theory part but we will also implement these technique in  python.

Before we go further below are the list of topics you will learn in this article.

Text Preprocessing Importance in NLP

As we said before text preprocessing is the first step in the Natural Language Processing pipeline. The importance of preprocessing is increasing in NLP due to noise or unclear data extracted or collected from different sources. 

Most of the text data collected from reviews of E-commerce websites like Amazon or Flipkart, tweets from twitter,  comments from Facebook or Instagram, and other websites like Wikipedia, etc. 

We can observe users use short forms, emojis, misspelling of words, etc. in their comments, tweets, and so on.

We should not feed raw data without preprocessing to  build models because the preprocessing of text directly improves the model’s performance.

If we feed data without performing any text preprocessing techniques, the build models will not learn the real significance of the data. In some cases, if we feed raw data without any preprocessing techniques the models will get confused and give random results. 

In that confusion, the model will learn harmful patterns that are not valuable. Due to this, the model’s performance will be affected, which means the model performance will reduce significantly.

So we should remove all these noises from the text and make it a more clear and structured form for building models.

Here we have to know one thing.

The natural language text preprocessing techniques will vary from problem to problem. This means we cannot apply the same text preprocessing techniques used for one NLP problem to another NLP problem. 

For example, in sentiment analysis classification problems, we can remove or ignore numbers within the text because numbers are not significant in this problem statement.

However, we should not ignore the numbers if we are dealing with financial related problems. Because numbers play a key role in these kinds of problems.

So while performing NLP text preprocessing techniques. We need to focus more on the domain we are applying these NLP techniques and the order of methods also plays a key role.

Don’t worry about the order of these techniques for now.  We will give the generic order in which you need to apply these techniques.

Our suggestion is to use preprocessing methods or techniques on a subset of aggregate data (take a few sentences randomly). We can easily observe whether it is in our expected form or not. If it is in our expected form, then apply on a complete dataset; otherwise, change the order of preprocessing techniques.

We will provide a python file with a preprocess class of all preprocessing techniques at the end of this article.

You can download and import that class to your code. We can get preprocessed text by calling preprocess class with a list of sentences and sequences of preprocessing techniques we need to use.

Again the order of technique we need to use will differ from problem to problem.

Different Text Preprocessing Techniques

Let us jump to learn different types of text preprocessing techniques. 

In the next few minutes, we will discuss and learn the importance and implementation of these techniques.

Converting to Lower case

Converting all our text into the lower case is a simple and most effective approach.  If we are not applying lower case conversion on words like NLP, nlp, Nlp, we are treating all these words as different words. 

After using the lower casing, all three words are treated as a single word that is nlp.

Converting to Lower Case

Converting To Lower Case

This method is useful for problems that are dependent on the frequency of words such as document classification. 

In this case, we count the frequency of words by using bag-of-words, TFIDF, etc.

It is better to perform lower case the text as the first step in this text preprocessing. Because if we are trying to remove stop words all words need to be in lower case.

For example, few sentences have the starting word as “The” if we are not performing the lower casing technique before that technique, we can not remove all stopwords

The other case is for calculating the frequency count. If we not converted the text into lower case Data Science and data science will treat as different tokens.

In natural language processing the lower dimension of text which is words called as tokens.

We can apply this method to most of the text related problems. Still, it may not be suitable for different projects like Parts-Of-Speech tag recognition or dependency parsing, where proper word casing is essential to recognize nouns, verbs, etc.

Implementation of lower case conversion

Removal of HTML tags

Removing html tags

Html Tags Removal

This is the second essential preprocessing technique. The chances to get HTML tags in our text data is quite common when we are extracting or scraping data from different websites. 

We don’t get any valuable information from these HTML tags. So it is better to remove them from our text data. We can remove these tags by using regex and we can also use the BeautifulSoup module from bs4 libraries. 

Let us see the implementation using python.

HTML tags removal Implementation using regex module

Html Tag removal Example

HTML tags removal example

Implementation of Removing HTML tags using bs4 library

We can observe both the functions are giving the same result after removing HTML tags from our example text.

Removal of URLs

Remove Urls

Remove Urls

URL is the short-form of Uniform Resource Locator. The URLs within the text refer to the location of another website or anything else.

If we are performing any website backlinks analysis, twitter or Facebook in that case, URLs are an excellent choice to keep in text.

Otherwise, from URLs also we can not get any information. So we can remove it from our text. We can remove URLs from the text by using the python Regex library.

Urls removal Example

Urls removal Example

Implementation of Removing URLs  using python regex

In the below script. We take example text with URLs and then call the 2 functions with that example text. In the remove_urls function, assign a regular expression to remove URLs to url_pattern after That, substitute URLs within the text with space by calling the re library’s sub-function.

Removing Numbers

We can remove numbers from the text if our problem statement doesn’t require numbers. 

For example, if we are working on financial related problems like banking or insurance-related sectors. We may get information from numbers.

In those cases, we shouldn’t remove numbers.

Removing Numbers

Removing Numbers

Implementation of Removing numbers  using python regex

In the code below, we will call the remove_numbers function with example text, which contains numbers.

Let’s see how to implement it.

In the above removing_numbers function. We mentioned a pattern to recognize numbers within the text and then substitute numbers with space using the re library’s sub-function.

And then return text after removing the number to numbers_result variable.

Converting numbers to words

If our problem statement need valuable information from numbers in that case, we have to convert numbers to words. Similar problem statements which are discussed at the removing numbers (above section).

Converting Numbers to Words

Converting Numbers to Words

Implementation of Converting numbers to words using python num2words library

We can convert numbers to words by just importing the num2words library. In the code below, we will call the num_to_words function with example text. Example text has numbers.

In the above code, the num_to_words function is getting the text as input. In that, we are splitting text using a python string function of a split with space to get words individually.  

Taking each word and checking if that word is digit or not. If the word is digit then convert that into words.

Apply spelling correction

Spelling Checking

Checking Spelling

Spelling correction is another important preprocessing technique while working with tweets, comments, etc. Because we can see incorrect spelling words in those areas of text. We need to make those misspelling words to correct spelling words.

We can check and replace misspelling words with correct spelling by using two python libraries, one is pyspellchecker, and another one is autocorrect.

Example of Spelling Correction

Example of Spelling Correction

Implementation of spelling correction using python pyspellchecker library

Below we are calling a spell_correction function with example text. Example text has incorrect spelling words to check whether the spell_correction function gives correct words or not.

Implementation of spelling correction using python autocorrect library

We can observe both methods given correct or expected solutions.

Convert accented characters to ASCII characters

This is another common preprocessing technique in NLP. We can observe special characters at the top of the common letter or characters if we press a longtime while typing, for example, résumé. 

If we are not removing these types of noise from the text, then the model will consider resume and résumé; both are two different words.

Even if both are the same. We can convert this accented character to ASCII characters by using the unidecode library.

Convert accented characters to ASCII

Convert accented characters to ASCII

Implementation of accented text to ASCII converter in python

We will define the accented_to_ascii function to convert accented characters to their ASCII values in the below script.  

We will do this function with example text.

In the above code, we use the unidecode method of the unidecode library with input text. Which converts accented characters to ASCII values.

Converting chat conversion words to normal words

This is another essential preprocessing technique if we work with chat conversions, or our problem statement requires chat conversion analysis. We need to handle short-form. As nowadays, people use short-form words in their chatting conversions for their simplicity.

Chat conversion to normal words

Chat conversion to normal words

A better way to work with those words is to replace short-form words to their original words.

We can find all those short-form words and its actual words in this Github Repo to save that file into our system; click right click and then press on save as option.

Implementation of python script

Expanding Contractions

Expanding Contractions

Expanding Contractions

Contractions are words or combinations of words created by dropping a few letters and replacing those letters by an apostrophe.

An example of a contraction word.

  • “don’t” is “do not” 
  • “should’ve” is “should have” 

Nlp models don’t know about these contractions; they will consider “don’t” and “do not” both are two different words.

We have to choose this technique if our problem statement is required. Otherwise,  leave it as it is.

Implementation of expanding contractions

In the code below, we are importing the CONTRACTION_MAP dictionary from the contraction file. And then define expand_contractions function to expand contractions if our input text has.

We can observe in the output, the contraction of “doesn’t” in the example text expanded to “does not”.

In the expand_contractions function, we take contraction words from our text matching with contraction map words. If we are not performing a lower case conversion technique before this, we have to take the first character to display the result of contraction “Doesn’t” like “Does not”.

Otherwise, we can ignore a few steps in the script.

Stemming

NLP Technique Stemming

NLP Technique Stemming

Stemming is reducing words to their base or root form by removing a few suffix characters from words. Stemming is the text normalization technique.

There are so many stemming algorithms available, but the most widely used one is porter stemming.

For example, the result of books after stemming is a book, and the result of learning is learn.

Stemming words example

Stemming words example

But stemming doesn’t always provide the correct form of words because this follows the rules like removing suffix characters to get base words.

Sometimes, stemming words don’t relate to original ones and sometimes give non – dictionary words or not proper words.  

For this, we can observe in the above table results of stemming “caring” and “console/consoling”. Because of these results stemming technique does not apply to all NLP tasks.

Implementation of Stemming using PorterStemming from nltk library

In the below python script, we will define the porter_stemmer function to implement the stemming technique. We will call the function with example text.

Before reaching the function, we have to initialize the object for the PorterStemmer class to use the stem function from that class.

In the porter_stemmer function, we tokenized the input using word_tokenize from the nltk library. And then, apply the stem function to each of the tokenized words and update the text with stemmer words.

Lemmatization

Lemming words example

Lemming words example

The aim of usage of lemmatization is similar to the stemming technique to reduce inflection words to their original or base words. But the lemmatization process is different from the above approach.

Lemmatization does not only trim the suffix characters; instead, use lexical knowledge bases to get original words. The result of lemmatization is always a meaningful word, not like stemming.

The disadvantages of stemming people prefer to use lemmatization to get base or root words of original words. This preprocessing technique is also optional; we have to apply it based on our problem statement.

Suppose we are doing POS (parts-of-speech) tagger problems. The original words of data have more information about data. As compared to stemming, the lemmatization speed is a little bit slow.

Let’s see the implementation of lemmatization using nltk library.

Implementation of lemmatization using nltk

In the below strip, before calling the lemmatization function, we have to initialize the object for WordNetLemmatizer to use it.

We can see the differences between the outputs of stemming and lemmatization. Programmers program programming all are different, and for languages, lemma gives meaningful words but stemming words for that are meaningless.

Differences between Stemming and Lemmatization

Stemming

  • Statistical method and text normalization technique.
  • In the process of stemming remove the suffix of words to get a base word.
  • Stemming does not always provide meaning or dictionary words  as its result.
  • The speed of the stemming process is fast.

Lemmatization

  • Lemmatization is also the same as stemming statistical methods and normalization techniques.
  • Lemmatization follows lexical knowledge to get the root word for original one.
  • The resulting words of lemmatization are always meaningful and dictionary words.
  • As compared to stemming the process, the speed of lemmatization is slow.
No emojis please

No emojis please

In today’s online communication, emojis play a very crucial role.

Emojis are small images. Users use these emojis to express their present feelings. We can communicate these with anyone globally. For some problem statements, we need to remove emojis from the text.

Let’s see on that type of problem statement how we can remove emojis.

Implementation of emoji removing

For this we take code snippets from this GitHub Repo.

Remove Emojis

Removal of Emoticons

Emoticons Removal

Emoticons Removal

Emojis and emoticons are both different. An emoticon portrays a human facial expression using just keyboard characters, such as letters, numbers, and punctuation marks.

This is also the same as emojis; if problem statements don’t require emoticons, we can remove them.

Implementation of removing of emoticons

To remove emotions from the text, we need a list of emoticons; in this GitHub Repo, we can find all emoticons as a dictionary.
We take an EMOTICONS dictionary from that GitHub repo and save it in our system as emoticons_list.py. After that, import that file into our preprocessing code.
Emoticons Removal example

Emoticons Removal Example

Converting Emojis to words

Converting Emojis to words

Converting Emojis to words

In the previous section, we removed emojis from the text, but some problem statements get information from emojis.

In that case, we shouldn’t remove emojis.

For example, if we are working on sentiment analysis on restaurant reviews data. One review is

“i ordered fried rice that is, 😋
😋

another review is

“i ordered fried rice that is 😞😠“.

If we remove emojis from these two sentences. We cannot get the user’s sentiment. So, in this case, we can convert emojis into words. 

Implementation of converting emoji to words using python

From this GitHub Repo, we can also get emojis words and Unicode of corresponding emojis in a dictionary.

 

Take an EMO_UNICODE dictionary from that git and save it in a python file, then we can import the EMO_UNICODE dictionary to our code.

 

EMO_UNICODE has emoji words as a key and unicode for that value. But for converting emojis to words, we need that dictionary in reverse like unicode as key and emoji word as value.

Emojis To Words Example

Converting Emoticons to words

The purpose of converting emoticons to words is also the same as converting emojis to words techniques. The only difference is here, converting emoticons to words.

Emoticons to words example

Emoticons to words example

Implementation of converting emoticons to words

Take the EMOTICONS dictionary from this GitHub Repo.  We saved that dictionary of emoticons in an emoticons_list python file.

In the below code, we import the EMOTICONS dictionary from that file.

Removing of Punctuations or Special Characters

Removing of Punctuations or Special Characters

Removing of Punctuations or Special Characters

Punctuations or special characters are all characters except digits and alphabets. List of all available special characters are [!”#$%&'()*+,-./:;<=>?@[]^_`{|}~].  

This is better to remove or convert emoticons before removing punctuations or special characters.

If we apply this technique process before emoticons related techniques, we may lose emoticons from the text. So if we apply the emoticons technique, apply before removing the punctuation technique.

For example, if we remove the period using the punctuation removing technique from text like “money 20.98”, we will lose the period (.) between 20 & 98. That completely lost their meaning.

So we have to focus more on choosing punctuations.

Removing of Punctuations or Special Characters example

Removing of Punctuations or Special Characters example

Implementation of removing punctuations using string library

Removing of Stopwords

Stopwords are common words and irrelevant words from which we can’t get any useful information for our model or problem statement.

Few stopwords are “a”, “an”, “the”, etc.  

For example, we can ignore stop words when we work with sentiment analysis, text classification problems. But in the case of POS (Parts-Of-Speech) tagging or language translation, we have to consider whether stop words also give more information and useful words for our problem statement.

Stopwords Example

Stopwords Example

We can import lists of stop words from different NLP related libraries such as nltk, spacy, gensim, etc.

let’s see how to remove stopwords from the text by using stop words from all these three libraries.

Implementation of removing stopwords using all stop words from nltk, spacy, gensim

The code mentioned above, we take stopwords from different libraries such as nltk, spacy, and gensim

And then take unique stop words from all three stop word lists. In the remove_stopwords, we check whether the tokenized word is in stop words or not; if not in stop words list, then append to the text without the stopwords list.

Removing of Frequent words

In the above section, we removed stopwords.

Stopwords are common words all over the language. These frequent words are common words of a particular domain.

If we are working on any problem statement for a specific field, we can ignore common words in that domain because those frequent words don’t give too much information.

Implementation of frequent words removing

Here we use the “Counter” function from the collection library to remove our corpus’s frequent words.

In the above script, we defined two functions one is for counting frequent words another is to remove them from our corpus.

Removing of Rare words

Removing rare words text preprocessing technique is similar to eliminating frequent words. We can remove more irregular words from the corpus.

Implementation of frequent words removing

In the below script, the same as the above one, we defined two functions: finding rare words and removing them. We take only ten rare words for this sample text; this number may increase based on our text corpus.

Removing single characters

After performing all text preprocessing techniques except extra spaces, removing this is better to remove a single character if there is any present in our corpus. We can remove using regex.

Implementation of removing single characters

Removing Extra Whitespaces

This is the last preprocessing technique. We can not get any information from extra spaces, so that we can ignore all additional spaces such as 0ne or more newlines, tabs, extra spaces.

Our suggestion is to apply this preprocessing technique at last after performing all text preprocessing techniques.

Implementation  of removing extra whitespaces

Process of applying all text preprocessing techniques with an Example 

For this process, we are providing a complete python code in our dataaspirant github repo. You have to download this preprocessing.py file After extracting the downloaded file.

Import it into our text preprocessing class from the preprocessing file. Now we will discuss how to use it.

Implementation of Complete preprocessing techniques 

Text Preprocessing Techniques flow

Text Preprocessing Techniques flow

In the below, we apply only a few text preprocessing techniques to know how we can use the importing class.

Here we are taking the Sms_spam_or_not dataset.

From the dataset, we are taking a text column and converting it into a list. We initiated an object for the prepress class, which one imported from a preprocessing file.

If we want to apply preprocessing techniques, send a list of sentences and a list of techniques to the preprocessing function by using the object of preprocessing.

We listed out all techniques with short forms in the comment section. Please send a list of short forms of corresponding techniques as a technique list.

Conclusion

In this article, most of the text preprocessing techniques are explained. We do not need to perform all preprocessing techniques. Just download the file and import the file in our code.

All function with a list of sentences and a list of text preprocessing techniques. Focus when we select techniques and also order because the preprocessing process depends on this order of processing techniques.

Recommended Natural Language Processing Courses

nlp specialization

NLP Specialization with Python

Natural-language-processing-classifiation-vector-spaces

NLP  Classification and vector spaces

natural-language-processing-python

NLP Model Building With Python



[ad_2]

Source link

Write a comment