## Sentiment Analysis using Logistic Regression and Naive Bayes | by Atharva Mashalkar | Nov, 2020

[ad_1]

## Let’s compare which algorithm is better for classifying the tweets based on their sentiments.

In supervised machine learning, you usually have an input X, which goes into your prediction function to get your Y^. You can then compare your prediction with the true value *Y*. This gives you your cost which you use to update the parameters *θ*.

** Sentiment analysis** (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

So, let’s start sentiment analysis using *Logistic Regression*

We will be using the sample twitter data set for this exercise.

Given a tweet, or some text, we can represent it as a vector of dimension V, where V corresponds to our vocabulary size. For example: If you had the tweet “I am learning sentiment analysis”, then you would put a 1 in the corresponding index for any word in the tweet, and a 0 otherwise. As we can see, as V gets larger, the vector becomes more sparse. Furthermore, we end up having many more features and end up training *θ* *V* parameters. This could result in larger training time and large prediction time. Hence, we will extract frequencies of every word and making a frequency dictionary.

The idea here is to divide the training set into positive and negative tweets. Count all the words and make a python dictionary of their frequencies in positive and negative tweets.

For every tweet make a vector of bias unit, sum of all the positive frequencies(words from positive tweets) of all the words and also their negative frequencies. We will go into detail regarding this in further paragraphs.

When preprocessing, you have to perform the following:

- Eliminate handles and URLs
- Tokenize the string into words.
- Remove stop words like “and, is, a, on, etc.”
- Stemming- or convert every word to its stem. Like a dancer, dancing, danced, becomes ‘danc’. You can use porter stemmer to take care of this.
- Convert all your words to lower case.

In order to carry out the above steps follow the below-given code snippets:

Import the libraries and sample twitter data set provided by nltk (Natural Language Toolkit) package, which contains 5000 positive and 5000 negative tweets. Also, let’s import some additional libraries which will help us in carrying out Regular Expression in python.

`import re`

import string

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from nltk.tokenize import TweetTokenizer

import numpy as np

Here we remove stopwords (words which don’t and any value to the model, without these words the model will provide the same accuracy, ex: ‘the’, ‘is’, ‘are’, etc.) and carry out stemming (removing suffix of few words in order to reduce the vocabulary size). We also import English stopwords from nltk library

**Note**: Here we are also tokenizing the string into a list of words after removing retweets, hashtags, URLs.

`#Preprocessing tweets`

def process_tweet(tweet):

#Remove old style retweet text "RT"

tweet2 = re.sub(r'^RT[s]','', tweet)#Remove hyperlinks

tweet2 = re.sub(r'https?://.*[rn]*','', tweet2)

#Remove hastags

#Only removing the hash # sign from the word

tweet2 = re.sub(r'#','',tweet2)

# instantiate tokenizer class

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

# tokenize tweets

tweet_tokens = tokenizer.tokenize(tweet2)

#Import the english stop words list from NLTK

stopwords_english = stopwords.words('english')

#Creating a list of words without stopwords

tweets_clean = []

for word in tweet_tokens:

if word not in stopwords_english and word not in string.punctuation:

tweets_clean.append(word)

#Instantiate stemming class

stemmer = PorterStemmer()

#Creating a list of stems of words in tweet

tweets_stem = []

for word in tweets_clean:

stem_word = stemmer.stem(word)

tweets_stem.append(stem_word)

return tweets_stem

Now, we will create a function that will take tweets and their labels as input, go through every tweet, preprocess them, count the occurrence of every word in the data set and create a frequency dictionary.

Note: The squeeze function is necessary or the list ends up with one element.

`#Frequency generating function`

def build_freqs(tweets, ys):

yslist = np.squeeze(ys).tolist()freqs = {}

for y, tweet in zip(yslist, tweets):

for word in process_tweet(tweet):

pair = (word, y)

freqs[pair] = freqs.get(pair, 0) + 1

return freqs

The required functions for processing tweets are ready, now let’s build our logistic regression model.

Logistic regression makes use of the sigmoid function which outputs a probability between 0 and 1. The sigmoid function with some weight parameter *θ* and some input x^{(i)}*x*(*i*) is defined as follows:-

h(x^(i), *θ) = 1/(1 + e^(-θ^T*x^(i)).*

The sigmoid function gives values between -1 and 1 hence we can classify the predictions depending on a particular cutoff. (say : 0.5)

Note that as (*θ^T)x*(*i*) gets closer and closer to −∞ the denominator of the sigmoid function gets larger and larger and as a result, the sigmoid gets closer to 0. On the other hand, (*θ^T)x*(*i*) gets closer and closer to ∞ the denominator of the sigmoid function gets closer to 1 and as a result the sigmoid also gets closer to 1.

As we have understood the sigmoid function now let’s code it!

Note: The function should work for a scalar as well as an array

`def sigmoid(z): `

'''

Input:

z: is the input (can be a scalar or an array)

Output:

h: the sigmoid of z

'''

# calculate the sigmoid of z

h = 1/(1 + np.exp(-z))return h

The logistic regression cost function is defined as

*J*(*θ*)=(−1/m)*∑*i*=1 to *m*[*y*(*i*)log(h(*x*(*i*),*θ*)+(1−*y*(*i*))log(1−*h*(*x*(*i*),*θ*))]

We aim to reduce cost by improving the theta using the following equation:

*θj*:=*θj*−*α**∂*J*(*θ*)/*θj*

Here, *α *is called the learning rate. The above process of making hypothesis (h) using the sigmoid function and changing the weights (*θ)* using the derivative of cost function and a specific learning rate is called the Gradient Descent Algorithm.

Note: You initialize your parameter *θ*, that you can use in your sigmoid, you then compute the gradient that you will use to update *θ*, and then calculate the cost. You keep doing so until good enough.

Let’s code what we learned.

`def gradientDescent(x, y, theta, alpha, num_iters):`

'''

Input:

x: matrix of features which is (m,n+1)

y: corresponding labels of the input matrix x, dimensions (m,1)

theta: weight vector of dimension (n+1,1)

alpha: learning rate

num_iters: number of iterations you want to train your model for

Output:

J: the final cost

theta: your final weight vector

Hint: you might want to print the cost to make sure that it is going down.

'''m = len(x)

for i in range(0, num_iters):

# get z, the dot product of x and theta

z = np.dot(x,theta)

# get the sigmoid of z

h = sigmoid(z)

# calculate the cost function

J = (-1/m)*(np.dot(y.T,np.log(h)) + np.dot((1-y).T,np.log(1-h)))

# update the weights theta

theta = theta - (alpha/m)*np.dot(x.T, h-y)

J = float(J)

return J, theta

Now, let’s create a function that will extract features from a tweet using the ‘freqs’ dictionary and above defined preprocessing function (process_tweet).

`def extract_features(tweet, freqs):`

'''

Input:

tweet: a list of words for one tweet

freqs: a dictionary corresponding to the frequencies of each tuple (word, label)

Output:

x: a feature vector of dimension (1,3)

'''

# process_tweet tokenizes, stems, and removes stopwords

word_l = process_tweet(tweet)# 3 elements in the form of a 1 x 3 vector

x = np.zeros((1, 3))

#bias term is set to 1

x[0,0] = 1

# loop through each word in the list of words

for word in word_l:

# increment the word count for the positive label 1

x[0,1] += freqs.get((word,1),0)

# increment the word count for the negative label 0

x[0,2] += freqs.get((word,0),0)

assert(x.shape == (1, 3))

return x

Now, we will import the data set from nltk and break it into a training set and test set

# split the data into two pieces, one for training and one for testing (validation set)

test_pos = all_positive_tweets[4000:]

train_pos = all_positive_tweets[:4000]

test_neg = all_negative_tweets[4000:]

train_neg = all_negative_tweets[:4000]train_x = train_pos + train_neg

test_x = test_pos + test_neg# combine positive and negative labels

train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)

test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

As all the required functions are ready we can finally train our model using the training data set and test it on the test data set

# collect the features 'x' and stack them into a matrix 'X'

X = np.zeros((len(train_x), 3))

for i in range(len(train_x)):

X[i, :]= extract_features(train_x[i], freqs)# training labels corresponding to X

Y = train_y# Apply gradient descent

J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)

print(f"The cost after training is {J:.8f}.")

print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

J is the final cost and “theta” are the final weights after training the model.

In order to check it before testing on the test data set.

# Check your function# test 1

# test on training data

tmp1 = extract_features(train_x[0], freqs)

print(tmp1)# #### Expected output

# ```

# [[1.00e+00 3.02e+03 6.10e+01]]

Lets, write two more functions which given a tweet will predict the result using the ‘freqs’ dictionary and theta. The second function will use the predict function and provide the accuracy of the model on the given testing data set.

def predict_tweet(tweet, freqs, theta):

'''

Input:

tweet: a string

freqs: a dictionary corresponding to the frequencies of each tuple (word, label)

theta: (3,1) vector of weights

Output:

y_pred: the probability of a tweet being positive or negative

'''# extract the features of the tweet and store it into x

x = extract_features(tweet, freqs)# make the prediction using x and theta

z = np.dot(x,theta)

y_pred = sigmoid(z)return y_pred

def test_logistic_regression(test_x, test_y, freqs, theta):

"""

Input:

test_x: a list of tweets

test_y: (m, 1) vector with the corresponding labels for the list of tweets

freqs: a dictionary with the frequency of each pair (or tuple)

theta: weight vector of dimension (3, 1)

Output:

accuracy: (# of tweets classified correctly) / (total # of tweets)

"""# the list for storing predictions

y_hat = []for tweet in test_x:

# get the label prediction for the tweet

y_pred = predict_tweet(tweet, freqs, theta)if y_pred > 0.5:

# With the above implementation, y_hat is a list, but test_y is (m,1) array

# append 1.0 to the list

y_hat.append(1)

else:

# append 0 to the list

y_hat.append(0)

# convert both to one-dimensional arrays in order to compare them using the '==' operator

y_hat = np.array(y_hat)

test_y = test_y.reshape(-1)

accuracy = np.sum((test_y == y_hat).astype(int))/len(test_x)return accuracy

On testing the model using the test data set we get an accuracy of 99.5%

Naive Bayes algorithm is based on the Bayes rule, which can be represented as follows:

*P*(*X*∣*Y*)=*P*(*Y*)*P*(*Y*∣*X*)*P*(*X*)

Here, the process up to creating a dictionary of frequencies (importing libraries, preprocessing, etc.) is the same. The way the algorithm works is as follows:-

- Find the log of the ratio of the number of positive tweets and negative sentiment tweets. i.e.

logprior :- log(𝑃(𝐷𝑝𝑜𝑠))−log(𝑃(𝐷𝑛𝑒𝑔))=log(𝐷𝑝𝑜𝑠)−log(𝐷𝑛𝑒𝑔)

2. Instead of keeping the frequencies of each word with the positive and negative labels we take the ratio of their frequency in that label by the total number of frequencies in that label. This will give the probability of occurrence of that word given the tweet is positive/negative.

3. Then we make another property called *loglikelihood. *It is the log of the ratio of Positive probability to that of the negative probability of a particular word. But what if the probability of the word is zero ( frequency is zero in either positive or negative case) the log may become +/- infinity. Hence to overcome this we use additive smoothing. This wiki article explains more about additive smoothing.

Therefore, to compute the positive probability and the negative probability for a specific word in the vocabulary, we’ll use the following inputs:

- 𝑓𝑟𝑒𝑞𝑝𝑜𝑠 and 𝑓𝑟𝑒𝑞𝑛𝑒𝑔 are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- 𝑁𝑝𝑜𝑠 and 𝑁𝑛𝑒𝑔 are the total numbers of positive and negative words for all documents (for all tweets), respectively.
- 𝑉 is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We’ll use these to compute the positive and negative probability for a specific word using this formula:

𝑃(𝑊𝑝𝑜𝑠)= (𝑓𝑟𝑒𝑞𝑝𝑜𝑠+1)/(𝑁𝑝𝑜𝑠+𝑉)

𝑃(𝑊𝑛𝑒𝑔)= (𝑓𝑟𝑒𝑞𝑛𝑒𝑔+1)/(𝑁𝑛𝑒𝑔+𝑉)

Notice that we add the “+1” in the numerator for additive smoothing.

And the loglikelihood can be represented as:-

loglikelihood=log(𝑃(𝑊𝑝𝑜𝑠)/𝑃(𝑊𝑛𝑒𝑔))

That’s it! We just need to code the above written in order to train our Naive Bayes function. So, first, let’s write a function that does all the above work.

`def train_naive_bayes(freqs, train_x, train_y):`

'''

Input:

freqs: dictionary from (word, label) to how often the word appears

train_x: a list of tweets

train_y: a list of labels correponding to the tweets (0,1)

Output:

logprior: the log prior. (equation 3 above)

loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)

'''

loglikelihood = {}

logprior = 0

# calculate V, the number of unique words in the vocabulary

vocab = set([pair[0] for pair in freqs.keys()])

V = len(vocab)

# calculate N_pos and N_neg

N_pos = N_neg = 0

for pair in freqs.keys():

# if the label is positive (greater than zero)

if pair[1] > 0:

# Increment the number of positive words by the count for this (word, label) pair

N_pos += freqs.get(pair, 1)

# else, the label is negative

else:

# increment the number of negative words by the count for this (word,label) pair

N_neg += freqs.get(pair, 1)

# Calculate D, the number of documents

D = len(train_y)

# Calculate D_pos, the number of positive documents (*hint: use sum(<np_array>))

D_pos = sum(train_y)

# Calculate D_neg, the number of negative documents (*hint: compute using D and D_pos)

D_neg = D-D_pos

# Calculate logprior

logprior = np.log(D_pos) - np.log(D_neg)

# For each word in the vocabulary...

for word in vocab:

# get the positive and negative frequency of the word

freq_pos = freqs.get((word, 1),0)

freq_neg = freqs.get((word, 0),0)

# calculate the probability that each word is positive, and negative

p_w_pos = (freq_pos + 1)/(N_pos + V)

p_w_neg = (freq_neg + 1)/(N_neg + V)

# calculate the log likelihood of the word

loglikelihood[word] = np.log(p_w_pos/p_w_neg)

return logprior, loglikelihood

logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

In order to predict the sentiment of a tweet we simply have to sum up the loglikelihood of the words in the tweet along with the logprior. If the value is positive then the tweet shows positive sentiment but if the value is negative then the tweet shows negative sentiment.

So let’s write the predicting ( takes in a tweet, loglikelihood, and logprior and returns the prediction) and a testing function ( to test the model using the test data set).

def naive_bayes_predict(tweet, logprior, loglikelihood):

'''

Input:

tweet: a string

logprior: a number

loglikelihood: a dictionary of words mapping to numbers

Output:

p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)'''

# process the tweet to get a list of words

word_l = process_tweet(tweet)# initialize probability to zero

p = 0# add the logprior

p += logpriorfor word in word_l:# check if the word exists in the loglikelihood dictionary

if word in loglikelihood:

# add the log likelihood of that word to the probability

p += loglikelihood[word]return pdef test_naive_bayes(test_x, test_y, logprior, loglikelihood):

"""

Input:

test_x: A list of tweets

test_y: the corresponding labels for the list of tweets

logprior: the logprior

loglikelihood: a dictionary with the loglikelihoods for each word

Output:

accuracy: (# of tweets classified correctly)/(total # of tweets)

"""

accuracy = 0 # return this properlyy_hats = []

for tweet in test_x:

# if the prediction is > 0

if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:

# the predicted class is 1

y_hat_i = 1

else:

# otherwise the predicted class is 0

y_hat_i = 0# append the predicted class to the list y_hats

y_hats.append(y_hat_i)# error is the average of the absolute values of the differences between y_hats and test_y

# Accuracy is 1 minus the error

error = np.mean(np.absolute(y_hats - test_y))

accuracy = 1 - errorreturn accuracy

On testing the model on the test data set we get an accuracy of 99.4%. which is slightly less may be due to the assumptions that the Naive Bayes algorithm makes. In fact, it called “Naive” due to its assumptions.

The assumptions are as follows:-

- Independence assumption

In the first image, you can see the word sunny and hot tend to depend on each other and are correlated to a certain extent with the word “desert”. Naive Bayes assumes independence throughout. Furthermore, if you were to fill in the sentence on the right, this naive model will assign equal weight to the words “spring, summer, fall, winter”.

2. Relative frequencies

On Twitter, there are usually more positive tweets than negative ones. However, some “clean” datasets you may find are artificially balanced to have the same amount of positive and negative tweets. Just keep in mind, that in the real world, the data could be much noisier.

Read More …

[ad_2]