Ten Fast Text Preprocessing Benchmarks on CPU, GPU, and TPU | by Bruce H. Cottman, Ph.D. | Jan, 2021

[ad_1]


We show Python code and benchmarks for ten different spaCy text preprocessing actions.

Figure 1. Racing computing platforms. Source: Photo: Pietro Mattia on Unsplash

Introduction

Estimates state that 70%–85% of the world’s data is text (unstructured data). Additionally, new deep learning language models (transformers) have caused explosive growth in industrial applications.

This blog is not a blog article introducing Natural Language Processing (NLP). The feeding of a sequence of tokens, created from the raw text, into different Natural Language models is not covered here. Instead, we focus on preprocessing text before it is input as tokens into a Natural Language model.

Raw text degrades the NLP modeling unless the noise removal operation deletes or transforms words in the text to the sequence of tokens. Noise removal is usually NLP model dependent. For example, email may or may not be removed if it is a text classification task or a text redaction task.

Normalization of the corpus (sequence of tokens) is transforming the text into a standard form. The most frequent example is normalization by transforming all characters to lowercase. Nevertheless, be careful. Some advanced NLP models make use of capitalization or uppercase information.

In production-grade (NLP), fast text preprocessing (noise cleaning and normalization) is critical to model production deployment.

We benchmark token noise cleaning and normalization preprocessing using spaCy on CPU, TPU, and three GPUs.

Colab Benchmark Platform

Google Cloud Platform (GCP) Colab is a customized Jupyter notebook image appearing as a cloud service in the GCP framework. The short definition is that “Colab is Jupyter notebook running in GCP.” Follow the URLs should any question arise about Jupyter or Colab.

One way to goto to Colab:

  1. Create a Google account;
  2. Goto to https://colab.research.google.com/, logged in with your Google account;
  3. Create a Google Drive and Colab access any files in your shared drive.

Note: You can access GitHub files via a GitHub URL or search by organization or use.

The Colab CPU configuration for the benchmark is Intel @2.2 Ghx with 3 CPUs, 6 cores, and 58 MB on CPU cache.

!cat /proc/cpuinfo

=>

Figure 2: Colab instance CPU resources configured.

Colab is free and can provision an Nvidia GPU or Google TPU for you.

Figure 3: Colab “Change runtime type” panel.
from tensorflow.python.client import device_libdevice_lib.list_local_devices()

=>

.
.
.
physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0"]

In this case, a Tesla (Nvidia) P100 GPU with 16 GB memory is provisioned. Depending on what is available, a T4 to high-end Nvidia V100 GPU.

We used Intel CPU, Google TPU, Nvidia T4, P100, and V100 for the benchmarks.

Creating the spaCy pipeline and Doc

spaCy is the fastest package we know for Natural Language Processing (NLP) operations. spaCy is an NLP library implemented both in Python and Cython. Because of Cython, parts of spaCy are faster than if implemented in Python. spaCy is available for the operating systems MS Windows, macOS, and Ubuntu and runs natively on Nvidia GPUs.

In order to preprocess text with spaCy, we transform the text into a sequence of tokens. The spaCy pipeline acts on a sequence of tokens resulting in a corpus Doc object.

Our text preprocessing end goal is to produce tokens that feed into our NLP models.

Figure 4 spaCy pipeline Source: https://github.com/explosion/spaCy/blob/master/website/docs/images/pipeline.svg

You configure spaCy to use a GPU by:

import spacyspacy.prefer_gpu() # or spacy.require_gpu()

=>

True

Adding emoji attributes to the spaCy pipeline

You can either create your own class from scratch, or even replace it with an entirely custom function. You customize the tokenizer by adding a custom pipeline. We cover spaCy pipeline customization in an upcoming blog article.

import en_core_web_lgnlp = en_core_web_lg.load()
emoji = Emoji(nlp)
nlp.max_length = len(long_spacy) + 10
nlp.add_pipe(emoji, first=True)
nlp = en_core_web_lg.load()
do = nlp.disable_pipes(["tagger", "parser"])
emoji = Emoji(nlp)
nlp.pipe_names

output = >

['emoji', 'ner']

Note: The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in . The reason is that there can only be one tokenizer, and while all other pipeline components take a and return it, the tokenizer takes a string of text and turns it into a sequence of tokens (.

Creating long_s Practice Text and Resulting spacy Corpus

We create , a long string with extra whitespace, emoji, email addresses, $ symbols, HTML tags, punctuation, and other text that may or may not be noise for the downstream NLP model.

MULPIPIER = int(3.8e3)
text_l = 300
long_s = ':( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 '
long_s += ' 888 eihtg DoD Fee https://medium.com/ #hash ## Document Title</title> '
long_s += ':( cat- n nip'
long_s += ' immed- n natedly <html><h2>2nd levelheading</h2></html> . , '
long_s += '# bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven '
long_s +=' $Shine $$beighty?$ '
long_s *= MULPIPIER
%time long_s_doc = nlp(long_spacy)
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:text_l])

=>

Wall time: <processor dependent> size: 307800

:( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$ :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$ :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$ :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # bhc@gmail.com f@z.yx can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$ :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508

Note: Size is several tokens in sequence in spaCy , not the number of characters in

Each token operation is described below. The time to compute each operation is given in Figure 5.

Figure 5. Benchmark times for spaCy actions by processor.

Removing emoji

You can remove emoji using spaCy pipeline add-on:

%time long_s_doc_no_emojicon = [token  for token in long_s_doc if token._.is_emoji == False]
print('size: {:g} {}'.format(len(long_s_doc_no_emojicon),long_s_doc_no_emojicon[:int(text_l/5)]))

Replace emojis with a phrase

We can translate emojicons into a natural language phrase.

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False 
else EMOJI_TO_PHRASE[token.text] for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

Removing e-mail address

We remove e-mail tokens using

%time tokens = [token for token in long_s_doc if not token.like_email]
print('size: {:g} {}'.format(len(tokens),tokens[:int(text_l/3)]))

EMOJI Sentiment Score

EMOJI Sentiment Score is not a text preprocessor in the classic sense. However, we find that emoji almost always is the dominating text in a document. For example, two similar phrases from legal notes email with opposite sentiment.

The client was challenging. :(The client was difficult. :)

We calculate only EMOJI when present in a note or e-mail.

%time scl = [EMOJI_TO_SENTIMENT_VALUE[token.text] for token in long_s_doc if (token.text in EMOJI_TO_SENTIMENT_VALUE)]
len(scl), sum(scl), sum(scl)/len(scl)

The sentiment was 0.07 (neutral) for 0.5 million characters “note” with 15,200 emojis and emoticons in the range 171 to 195 ms using different processors. A fast sentiment analysis calculation!

Remove whitespace and punctuation

We remove whitespace and punctuation simultaneously using spaCy tokens.

%time tokens = [token.text for token in long_s_doc if (token.pos_ not in ['SPACE','PUNCT'])]
%time text = ' '.join(tokens)
print('size: {:g} {}'.format(len(text),text[:text_l]))

Replacing Currency Symbol

We remove the currency symbol in tokens using

%time token = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(token)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))aa

Note: spacy removes all punctuation including emoji and emoticon. You can protect the emoticon with:

%time long_s_doc = [token  for token in long_s_doc if token.is_punct == False or token._.is_emoji == True]
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:50]))

However, and regex ignores context and replace any currency symbol. You may have multiple uses of in your text and thus can not ignore context. In this case, you can use spaCy.

%time tk = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

Removing stop-words

NLP models (ex: logistic regression and transformers) and NLP tasks (Sentiment Analysis) continue to be added. Some benefit from stopword removal, and some will not. — Industrial-Strength Natural Language Processing; ] Turbo-charge your spaCy NLP pipeline

Note: We use different deep learning language models (transformers) and do not remove stopwords.

%time tokens = [token.text for token in long_s_doc if token.is_stop == False]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

Lemmatization

Lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.

Lemmatization looks at the surrounding text to determine a given word’s part of speech. It does not categorize phrases.

%time tokens = [token.lemma_ for token in long_s_doc]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

Note: spaCy does not have stemming. You can add if it is you want. Stemming does not work as well as Lemmazatation because Stemming does not consider context . (Why some researcher considers spacy “opinionated”).

Read More …

[ad_2]


Write a comment