Where should I eat after the pandemic? (Part 2/2) | by Matthew Brulhardt | Jan, 2021
In the last article, I trained a model on the ABSA task from the SemEval-2014 dataset and analyzed its performance, speed, and behaviors. This article details how I use this model to choose a restaurant to dine at from the Yelp dataset. Without further delay, let’s get started!
We can download the Yelp dataset in one of two ways:
The first requires signing an agreement with Yelp, after which the dataset can be downloaded as a zip file. The Kaggle method requires an account with a username and API key setup locally at ~/.kaggle/kaggle.json. The following link can help you get set up with Kaggle:
If you use Yelp, you have to download the zip file, then upload it to Google Colab. If you use Kaggle, however, you can avoid the extra step and stream the data into Google Colab directly. For myself, I’ll use the Kaggle method for this tutorial. The following code downloads the dataset from Kaggle:
To score each review, we need to map each label to a polarity ranging from [-1, 1].
Instead of taking the classifier’s hard classification, it will be more informative to deal with its soft. The soft classification will be informative on how positive or negative an aspect of a review might be. We can find an aspect’s expected polarity over a review by utilizing the probability vector produced by the model and the polarity map we’ve defined above. For example, let’s say we input a review to the model and evaluate along the aspect of food and get the following output:
Instead of assigning food to have a polarity of -1, we find the expected polarity over these labels as follows:
As we can see, this method is much more practical because we can get a gradient between positive and negative polarities, rendering a more representative sentiment for each category.
There is one important thing to note before we run our algorithm. There are two sources of reviews in the dataset:
tips. Tips are more compact than reviews; they’re generally only one sentence. This is helpful because one of our model’s shortcomings is classifying larger bodies of text, as we saw in the previous article. Here is a comparison of the cumulative distributions for the number of words found in the text from each type of review:
Clearly, the number of words in a tip is much fewer than in a review. Thus, tips will work well within the bounds of the model. With this in mind, I’m also going to filter for restaurants that aren’t closed down and have greater than or equal to 100 associated tips. This increases the likelihood that there is sufficient information to generate a rating for each aspect. The following code will run batch predictions for all of our filtered tips and write them to a JSON file to be processed in the next step.
Additionally, I’m going to append one more aspect to our set: the restaurant’s overall star rating. Instead of using the stars directly, I’ll make an adjusted star system considering each user’s average stars.
In assessing the reviews, the bias distribution is skewed left. Meaning, people tend to give more stars, on average, than expected in a proper 1–5 star rating system.
Read More …