Learn a New Language in a WordCloud | by David Bruce | Nov, 2020
That said, I am a novice in NLP and just wanted to get my feet wet and create my very first wordcloud and further my own linguistic diversity, as well as do my own part to bolster the #BenderRule by bucking against the notion of English being the default “natural language.” Now my example below is a minuscule task in helping me achieve the personal goal of one day reading Don Quijote in Spanish (fingers crossed), and hardly fights for the kind of equity needed in NLP today as Spanish is also a relatively well documented language of the 7,000 we are aware of today, but I hope in the very least this article has opened your eyes to the disparity within NLP, and furthers the Bender Rule of naming the languages you’re working with at a bare minimum — even English.
So today I am going to be working with English and Spanish working with some code I learned from YouTube by Kostadin Ristovski, creating two simple wordclouds that will help me read Don Quijote in its original Spanish one day. I started by downloading .txt files of both the original and English translations of the book from Project Gutenberg. They compile and publish free e-books that are available in the public domain in the US. Let’s begin by importing the proper packages:
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
We’ve included numpy and matplotlib to help visualize. Of course if wordcloud is not already installed you can always add it through the command line:
pip install wordcloud
For our most basic version we can read our .txt file, instantiate a WordCloud object, and then visualize. Below you can see some of the arguments you can play with including the background color as well as the number of words and even the fontsize. I let those work on default for this. I gave the wordcloud the name
wc_eng to indicate that this is my English wordcloud.
text_english = open('don_quijote_eng.txt', 'r').read()wc_eng = WordCloud()
One more thing before we get to making our wordcloud in Spanish is stopwords. Stopwords are the basic words in any language that don’t significantly alter the meaning of a sentence. For example, in English, stopwords are like “the, a, and, for, etc.” They are usually very common, and are thus included in lists called stopwords that you can keep from being part of your wordcloud. You can also manually enter specific words you would like to refrain from entering the wordcloud. You’ll see that we already imported the STOPWORDS from wordcloud, so let’s see what we can make. Let’s read in the Spanish text now and plot them side by side:
text_esp = open('don_quijote_esp.txt', 'r').read()fig = plt.figure(figsize=(15,6))
ax = fig.subplots(1,2)stopwords = set(STOPWORDS)wc_eng = WordCloud(background_color='white', stopwords=stopwords)
wc_eng.generate(text_eng)wc_esp = WordCloud(background_color='white', stopwords=stopwords)