Roadmap to Natural Language Processing (NLP)


By Pier Paolo Ippolito, The University of Southampton




Due to the event of Big Data over the past decade. organizations are actually confronted with analysing giant quantities of knowledge coming from all kinds of sources each day.

Natural Language Processing (NLP) is the realm of analysis in Artificial Intelligence centered on processing and utilizing Text and Speech knowledge to create good machines and create insights.

One of these days most attention-grabbing NLP software is creating machines ready to talk about with people about advanced subjects. IBM Project Debater represents to this point probably the most profitable approaches on this space.

Video 1: IBM Project Debater


Preprocessing Techniques

Some of the commonest methods that are utilized so as to put together textual content knowledge for inference are:

  • Tokenization: is used to section the enter textual content into its constituents phrases (tokens). In this manner, it turns into simpler to then convert our knowledge right into a numerical format.
  • Stop Words Removal: is utilized so as to take away from our textual content all of the prepositions (eg. “an”, “the”, and many others…) which may simply be thought-about as a supply of noise in our knowledge (since they don’t carry extra informative info in our knowledge).
  • Stemming: is lastly utilized in order to eliminate all of the affixes in our knowledge (eg. prefixes or suffixes). In this manner, it might in actual fact develop into a lot simpler for our algorithm to not think about as distinguished phrases which have truly comparable which means (eg. perception ~ insightful).

All of those preprocessing methods may be simply utilized to several types of texts utilizing normal Python NLP libraries similar to NLTK and Spacy.

Additionally, so as to extrapolate the language syntax and construction of our textual content, we are able to make use of methods similar to Parts of Speech (POS) Tagging and Shallow Parsing (Figure 1). Using these methods, in actual fact, we explicitly tag every phrase with its lexical class (which relies on the phrase syntactic context).


Figure 1: Parts of Speech Tagging Example [1].


Modelling Techniques


Bag of Words

Bag of Words is a method utilized in Natural Language Processing and Computer Vision so as to create new options for coaching classifiers (Figure 2). This method is carried out by developing a histogram counting all of the phrases in our doc (not bearing in mind the phrase order and syntax guidelines).


Figure 2: Bag of Words [2]


One of the primary issues which may restrict the efficacy of this method is the presence of prepositions, pronouns, articles, and many others… in our textual content. In truth, these can all be thought-about as phrases that are probably to seem ceaselessly in our textual content even with out essentially being actually informative to find out what are the primary traits and subjects in our doc.

In order to remedy the sort of drawback, a method referred to as “Term Frequency-Inverse Document Frequency” (TFIDF) is usually used. TFIDF goals to rescale the phrases rely frequency in our textual content by contemplating how ceaselessly every of the phrases in our textual content seems total in a big pattern of texts. Using this method, we’ll then reward phrases (scaling up their frequency worth) which seem fairly generally in our textual content however not often in different texts, whereas punishing phrases (cutting down their frequency worth) which seem ceaselessly in each our textual content and different texts (similar to prepositions, pronouns, and many others…).


Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a kind of Topic Modelling method. Topic Modelling is a discipline of analysis centered on discovering out methods to cluster paperwork so as to uncover latent distinguishing markers which may characterize them primarily based on their content material (Figure 3). Therefore, Topic Modelling may also be thought-about on this ambit as a dimensionality discount method because it permits us to scale back our preliminary knowledge to a restricted set of clusters.


Figure 3: Topic Modelling [3]


Latent Dirichlet Allocation (LDA) is an unsupervised studying method used to discover out latent subjects which may characterize completely different paperwork and cluster collectively comparable ones. This algorithm takes as enter the quantity N of subjects that are believed exists after which teams the completely different paperwork into N clusters of paperwork that are intently associated to one another.

What characterises LDA from different clustering methods similar to Ok-Means Clustering is that LDA is a soft-clustering method (every doc is assigned to a cluster primarily based on a chance distribution). For instance, a doc may be assigned to a Cluster A as a result of the algorithm determines that it’s 80% probably that this doc belongs to this class, whereas nonetheless bearing in mind that some traits embedded into this doc (the remaining 20%) are extra probably to belong as an alternative to a second Cluster B.


Word Embeddings

Word Embeddings are probably the most widespread methods to encode phrases as vectors of numbers which may then fed in into our Machine Learning fashions for inference. Word Embeddings goal to reliably rework our phrases right into a vector house in order that comparable phrases are represented by comparable vectors.


Figure 4: Word Embedding [4]


Nowadays, there are three foremost methods utilized in order to create Word Embeddings: Word2Vec, GloVe and quickText. All these three methods, use a shallow neural community so as to create the specified phrase embedding.

In case you may be serious about discovering out extra about how Word Embeddings works, this text is a good place the place to begin.


Sentiment Analysis

Sentiment Analysis is an NLP method generally utilized in order to perceive if some type of textual content expresses optimistic, unfavorable or impartial sentiment a couple of matter. This may be notably helpful to do when for instance making an attempt to discover out what’s the common public opinion (by means of on-line critiques, tweets, and many others…) a couple of matter, product or an organization.

In sentiment evaluation, sentiments in texts are normally represented as a worth between -1 (unfavorable sentiment) and 1 (optimistic sentiment) referred to as polarity.

Sentiment Analysis may be thought-about as an Unsupervised Learning method since we’re not normally supplied with handcrafted labels for our knowledge. In order to overcome this impediment, we make use of prelabeled lexicons (a ebook of phrases) which had been created to quantify the sentiment of numerous phrases in numerous contexts. Some examples of broadly used lexicons in sentiment evaluation are TextBlob and VADER.



Transformers characterize the present state-of-the-art NLP fashions so as to analyse textual content knowledge. Some examples of broadly recognized Transformers fashions are BERT and GTP2.

Before the creation of Transformers, Recurrent Neural Networks (RNNs) represented probably the most environment friendly means to analyse sequentially textual content knowledge for prediction however this strategy discovered fairly tough to reliably make use of long run dependencies (eg. our community may discover tough to perceive if a phrase fed in a number of iterations in the past may outcome to be helpful for the present iteration).

Transformers efficiently managed to overcome this limitation thanks to a mechanism referred to as Attention (which is utilized in order to decide which elements of the textual content to concentrate on and provides extra weight). Additionally, Transformers made simpler to course of textual content knowledge in parallel quite than sequentially (subsequently enhancing execution velocity).

Transformers can these days be simply carried out in Python thanks to Hugging Face library.


Text Prediction Demonstration

Text prediction is without doubt one of the duties which may be simply carried out utilizing Transformers similar to GPT2. In this instance, we’ll give as enter a quote from “The Shadow of the Wind” by Carlos Ruiz Zafón and our transformer will then generate different 50 characters which ought to logically comply with our enter knowledge.

A ebook is a mirror that provides us solely what we already carry inside us. It is a means of understanding ourselves, and it takes a complete lifetime of self consciousness as we develop into conscious of ourselves. This is an actual lesson from the ebook My Life.

As may be seen from our instance output proven above, our GPT2 mannequin carried out fairly properly in making a resealable continuation for our enter string.

An instance pocket book which you’ll run so as to generate your personal textual content is accessible at this hyperlink.

I hope you loved this text, thanks for studying!



If you need to maintain up to date with my newest articles and tasks comply with me on Medium and subscribe to my mailing record. These are a few of my contacts particulars:



[1] Extract Custom Keywords utilizing NLTK POS tagger in python, Thinkinfi, Anindya Naskar. Accessed at:
[2] Comparison of phrase bag mannequin BoW and phrase set mannequin SoW, ProgrammerSought. Accessed at:;jsessionid=0187F8E68A22612555B437068028C012
[3] Topic Modeling: Art of Storytelling in NLP,
TechnovativeThinker. Accessed at:
[4] Word Mover’s Embedding: Universal Text Embedding from Word2Vec, IBM Research Blog. Accessed at:

Bio: Pier Paolo Ippolito is a remaining yr MSc Artificial Intelligence pupil at The University of Southampton. He is an AI Enthusiast, Data Scientist and RPA Developer.

Original. Reposted with permission.



Source hyperlink

Write a comment