Natural Language Generation (Practical Guide) | by German Sharabok | Jan, 2021

[ad_1]


Natural Language Generation (NLG) is a kind of AI that is capable of generating human language from structured data. It is closely related to Natural Language Processing (NLP) but has a clear distinction.

To put it in simple words, NLP allows the computer to read, and NLG to write. This is a fast-growing field, which allows computers to understand the way we communicate. Currently, it is used for writing suggestions, such as autocompletion of emails, and even on producing human-readable texts without any human intervention.

In this article, you will learn in practice how this process works, and how you can generate artificial news headlines with just a few lines of code, which are indistinguishable from real ones.

We will use Markov Chains for this task, so first let’s understand what they are, and why we want to use them.

Markov Chain is a system that transitions between states using a random, memoryless process. You have probably studied them in a math course in the past but maybe did not see a practical way of applying them.

Let’s take a look at an example. Alex is a baby, and he only does three things: eat, sleep, and cry. We call them states, and all the states Alex can be in is called state space.

So a Markov Chain shows us all the possible transitions between these states and the possibility of each one of them happening.

Let’s take a look at the image above to better understand the theory behind it. From sleeping Alex can transition to crying, eating, or back to sleeping, each one with certain probabilities.

By observing Alex, over time we are able to find out these probabilities based on his normal behaviors. This is the basic idea behind Markov Chains but why are we using them for NLG?

The answer is fairly simple. The way we structure sentences can be easily models by Markov Chains. Knowing that each word is a state and a sentence is just a set of transitions between each one for the states allows us to construct a chain similar to the one above.

By looking at many different sentences we can see how people normally form sentences and in which order the words should come. After seeing enough sentences we can actually develop a model that can generate sentences similar to the ones made by humans.

Since computers are so good at calculations, we are making the best use of them by merely having states and probabilities. The computer will choose where to transition based on the list of probabilities, which will allow us to generate news headlines in our example. This technique can be applied to any type of text, so I encourage you to use the model you build to produce other texts, for example, tweets of some celebrity.

For this little project, we will make use of Markovify, which is a Markov Chain generator conveniently available for Python. It is often used to build Markov models of large corpora of text and generate random sentences out of that. However, its applications are not limited, so feel free to check its real-world application on the Markovify GitHub page.

We will also use the “A Million News Headlines” dataset from Kaggle to train our model. It includes a large number of headlines from articles published by ABC news.

You do not need any programming experience to complete the project so do not worry about that. For Python, you can simply go to Kaggle and under the Notebooks section create your own notebook. In this file, you can do all the further steps.

Loading the packages

We will use pandas to structure the headline data. It is also useful for data analysis and manipulation. Markovify is the model we need for Markov Chains.

import pandas as pd
import markovify
import random

Reading the input file

If you are using Kaggle Notebooks, simply use the line below to load the dataset. If you want to run the code on your computer, download the dataset and put the path to it in the brackets.

input = pd.read_csv('../input/abcnews-date-text.csv')

input.head(10)

Sample Headlines

The following code will output random 10 headlines from the first 100 so we can take a look at what the data looks like.

input.headline_text[random.sample(range(100), 10)]

Building a model

Here we use the input data to build our Markovify model. The sate_size variable allows us to choose the number of words our model will be looking at when generating new sentences.

For example, state_size of 2 will check the previous 2 words and decide which word can come next. The larger this variable is, the more complex the model is. However, if it is too large, we will simply not be able to generate many sentences as there is not enough data in our dataset.

model = markovify.NewlineText(input.headline_text, state_size = 2)

Generate Headlines

Now, this is the fun part since the following code will generate 10 headlines for us. Every time it will be different, so you can run the code again and again.

for i in range(10):
print(model.make_sentence())

Different State Sizes

Here we have different state sizes, which will make the headlines more logical and actually seem like real ones.

model1 = markovify.NewlineText(input.headline_text, state_size = 3)
model2 = markovify.NewlineText(input.headline_text, state_size = 4)

State Size 3:

for i in range(5):
print(model1.make_sentence())

State Size 4 (Note: we have to check 4 previous words here, so some sentences may not have that many. That is why we need to check if the generated sentence is not None):

for i in range(10):
temp = model2.make_sentence()
if temp is not None:
print(temp)

Ensemble Models

We can also combine two or more models together to see whether the results will be improved.

model11 = markovify.NewlineText(input.headline_text, state_size = 2)
model12 = markovify.NewlineText(input.headline_text, state_size = 2)
model_combo = markovify.combine([ model11, model12 ], [ 1.5, 1 ])

for i in range(5):
print(model_combo.make_sentence())

You have now learned a real application of Markov Chains, and how to get started with NLG. The process is fairly simple, and the code above can basically be applied for any other generation task.

[1] Kulkarni, R. (n.d.). A Million News Headlines. Retrieved January 18, 2021, from https://www.kaggle.com/therohk/million-headlines

[2] Singer-Vine, J. (n.d.). Markovify. Retrieved January 18, 2021, from https://github.com/jsvine/markovify

Read More …

[ad_2]


Write a comment