The Year 2020: Analyzing Twitter Users’ Reflections using NLP | by Jessica Uwoghiren | Dec, 2020
A Sentiment Analysis Project using Python and Tableau
A lot happened this year and if you watch the movie “Death to 2020” on Netflix, you will have an idea of the timeline of events. For this project, I thought it would be interesting to gain insights into what Twitter users had to say about 2020. Twitter receives more than 500 million tweets a day, so all I had to do was find a way to retrieve these tweets and analyze them. However, before embarking on this project, I had no idea what strategy to use. So I found myself browsing through dozens of articles on various concepts in Natural Language Processing (NLP) and while reading these articles, I became more interested in the topic. The tweets used for this project were created between 12 and 25 December 2020. If you tweeted about 2020 during this period, there is a good chance your tweet is part of this analysis. At the end of this article, you will learn:
- Most common words used by Twitter users to describe the year 2020
- Time of the day when Twitter users are more active (by Country and Continent) through the interactive Tableau Dashboard
- The proportion of positive, negative, and neutral tweets
- The country with the most tweets
- The most “retweeted” and “liked” tweet within the period
- The duration of this project and how to implement a similar project by yourself
Let‘s dive into my analysis, I can’t wait to show you what‘s been discovered.
This project had several facets outlined in the flow diagram below. I will explain the basics, while further discussion on some concepts will be done in subsequent Medium posts.
The Python libraries used include Pandas (for data cleaning/manipulation), Tweepy (for Tweet mining), NLTK (Natural Language Toolkit for text analysis), TextBlob (for sentiment analysis), MatPlotlib & WordCloud (for Word Cloud visualization), Emot (for emojis identification), Plotly (for data visualization) and other built-in libraries as shown in my Jupyter Notebook.
This was probably the most arduous part of the project because, unlike my previous projects, where I had existing datasets, I had to build this one from scratch. To do that, I used the Tweepy library for Python to scrape tweets.
Thanks to this Toward Data Science article by Tara Boyle, I found my way around the Twitter API (Application Programming Interface). However, some things had changed because Twitter now has stricter limits on mining tweets through their API. One of these limits is that you can only retrieve a maximum of 2,500 tweets every 15 minutes. Since I wanted to work with a large dataset, I used the “wait_on_rate_limit” parameter in Tweepy that makes the code sleep every 15 minutes. Also, tweets can only be mined as far back as 10 days. After three consecutive days of running the program I wrote, I had scraped 50,780 unique tweets. The highlights of this step are stated below. You can see detailed explanations in my Jupyter Notebook.
Highlights of Tweet Mining Task
- Search Query: I passed 4 different phrases – [“2020 has been”, “2020 was a”, “this year has been”, “this year was a”] to the API so that it returns tweets containing them. Twitter requires a specific syntax to recognize that you want an “exact phrase” match. Also, I only mined tweets created in English for this analysis. *For Twitter users in non-English speaking countries, their views might be underrepresented.
- Information Returned: I specified that the API returns the following data for each tweet – Tweet ID (primary key), Tweet, Time Created, Location, Number of Retweets and Likes. I did not retrieve Twitter usernames for ethical reasons.
- Successive Mining Attempts: On the second and third day of my code execution, I had to specify the “since_id” parameter so that the Twitter API does not return tweets already in the dataset from the previous day(s).
Cleaning up your data is very vital because it helps to prevent errors in your analysis, avoid data duplication, etc. In this step, I looked for duplicate tweets by using the Primary key (Tweet ID), checked for empty rows and replaced “NaN” or Null values for the “Location” column with the string – “No Location” (I will explain why I did this in the Location Geocoding section).
To achieve the ultimate goal, i.e. Sentiment Analysis, there was a need to clean up the individual tweets. To facilitate this task, I created a function called “preProcessTweets” in my Python program which I later applied to the “Tweets” to produce the desired results. This function was used to remove punctuations, links, emojis, and stop words from the tweets in a single run. Additionally, I used a concept known as “Tokenization” in NLP. It is a method of splitting a sentence into smaller units called “tokens” to remove unnecessary elements. Another noteworthy technique is “Lemmatization”. This is a process of returning words to their “base” form. A simple illustration is shown below.
In this section, I will show you the most common words used by Twitter users to describe 2020. I created a “getAdjectives” function to extract only adjectives for each tweet to a new column because adjectives are descriptive words. This was made possible by the POS-tag (Parts of Speech tagging) module of the NLTK library. Using the WordCloud library, you can generate a Word Cloud based on the frequency of words and superimpose these words on any image (in this case, the Twitter logo). Also, I used the Pyplot module in the Matplotlib library to display the image. The Word Cloud shows the words with a higher frequency in larger text size while the “not so” common words are in smaller text sizes.
You can view my Jupyter Notebook for the code to achieve the Word Cloud above. As you can see, the words “good”, “hard”, “bad”, “great”, “tough”, “last”, “difficult” were some of the most common words used. The frequency of the top ten words is displayed in the plot below.
For my final dashboard, I wanted to add a map that shows the number of tweets per country. To do this, Tableau needs basic geographic information such as the country’s name. I had used Geopy library for my previous project but this time around, I could not use it due to Server limit errors. After further research, I ended up using the Developer Here API to return Longitude, Latitude and Country name for each tweet location. One key thing to note is that if you send a request with a “NaN” or Null location to the API, it will return an actual location. That is why I had to replace the “NaN” values with “No Location” in the Data Cleaning step. It is very important to review the data frame after each code run to ensure you are getting the expected results. This was how I caught this discrepancy.
I will write about using the Developer Here API in the future because of the challenges I encountered while attempting Geocoding with other APIs.
Now, to the core of this project – Sentiment Analysis. From the words we use in our statements, one can tell whether they are Positive, Negative or Neutral. However, what if we can train a computer or model to do this automatically?
The above illustration is Sentiment Analysis in a nutshell. Thanks to the creators of Sentiment analysis algorithms contained in libraries such as TextBlob and VADER, we can analyze text and return their Sentiment score. This is a part of Unsupervised Machine Learning (UML). You can also train a Machine Learning model to predict these sentiments but, you would need a dataset of tweets with accurate sentiments to do so.
I must point out here that these algorithms have their error margins because the context for the trained model is different from the context for these tweets. For this analysis, I went with TextBlob. TextBlob analyzes sentences by giving each sentence a Subjectivity and Polarity score.
Based on the Polarity scores, one can define the tweets’ sentiment category. A Polarity score of < 0 is Negative, 0 is Neutral, while > 0 is Positive. I used the Pandas “apply” method on the “Polarity” column in my data frame to return the respective Sentiment category. The distribution of the Sentiment categories is shown below. You can also view the Sentiment Category distribution by country and continent in the Tableau dashboard HERE.
I was so excited to build this dashboard because I never knew how to use Tableau before starting this project. To develop the final dashboard that you see in the animation below, I exported the results from my Jupyter Notebook (where I run my Python program) to Tableau. The Tableau dashboard has six unique elements. Explore the dashboard by clicking this LINK. The dashboard can be viewed on any device, however, for a fuller view use a computer or tablet.
Fun fact: I learned how to use Tableau in two hours using this Tableau Community Tutorial on December 23rd. This was after I discovered that I could not publicly share my dashboard created using Microsoft PowerBI. Also, the entire project took me two weeks since I had to combine it with my full-time job.
Read More …