Build SMS Spam Classification Model using Naive Bayes & Random Forest | by Dhaval Thakur | Nov, 2020


Dhaval Thakur

If you might be into data science and on the lookout for starter tasks then the SMS Spam classification Project is a type of you must work upon! In this tutorial, we might go step by step from importing libraries to full mannequin prediction and these days measuring the accuracy of the mannequin.

Image by Dhaval (drawn on iPad)

A superb textual content classifier is a classifier that efficiently categorizes giant units of textual content paperwork in an affordable timeframe and with acceptable accuracy, and that gives classification guidelines which are humanly readable for potential fine-tuning. If the coaching of the classifier can also be fast, this might grow to be in some software domains a very good asset for the classifier. Many methods and algorithms for computerized textual content categorization have been devised.

The textual content classification process may be outlined as assigning class labels to new paperwork based mostly on the data gained in a classification system on the coaching stage. In the coaching part, we’re given a set of paperwork with class labels connected, and a classification system is constructed using a studying technique. Classification is a vital process in each information mining and machine learning communities, nonetheless, a lot of the studying approaches in textual content categorization are coming from machine learning analysis.

For this venture, I’d be using Google Colab, however you should utilize python Notebook additionally for a similar goal.

Importing of Libraries

First, we might import the required libraries corresponding to pandas, matplotlib, numpy, sklearn

Note: the final line of the code snippet may be eliminated if you’re not using Google Colab. This final line is for mounting my Google Drive over Google Colab in order that I can use the dataset current in my drive.

Importing the dataset

I’d be importing the dataset in my GitHub repo which may be discovered right here.

After downloading the dataset we might import it using pandas’ read_csv operate.

Note: Please use your individual path for the dataset.

Now as we have now imported the dataset, let’s have a look at if we have now imported the dataset incorrect format or not by using head() operate.

From the above dataset snippet, I see that we have now the column names which we do not require! Thus now comes the duty of cleansing and reformatting the info for us to make use of it to construct our mannequin.

Data Cleaning & Exploration

Now we have now to take away unnamed columns. To accomplish that we might use the drop operate.

Now, the subsequent process is to rename the columns v1 and v2 to label and message respectively!

Now, moreover (its an non-obligatory step however its at all times good to do some information exploration additionally 😛 )

Next factor we need to know what number of messages are ham and what number of messages are spam in our dataset. For that:

Explanation: Here we set the type = True and use the value_counts technique of Pandas. This code would make a bar plot of inexperienced and purple coloration respectively for spam and not spam courses.

The output you may be getting could be just like this:

We see that we have now lots of ham messages whereas much less spam messages. In this tutorial, we might go on ahead with this dataset solely with out augmenting it (no oversampling/below sampling) I’d do right here.

So first let me encode spam and never spam messages as 1 and Zero respectively.

Now, the second line of the above code snippet makes use of the sklearn library splot technique to separate the info into coaching and testing dataset. Here I’ve talked about the check information dimension to be 70 p.c of the entire dataset. (You can change it based on your want right here )


Now I’d be using the Multinomial Naive Bayes algorithm!

As you may see that I’ve integrated a recall check and precision check additionally to entry my mannequin extra precisely as how a lot good my mannequin is performing.

Now for various values of alpha, I’d make a desk to see numerous measures corresponding to Train Accuracy, Test Accuracy, Test Recall, Test Precision.

Now we have now to see the most effective index for Test Precision, as I’m involved extra about it right here. Note that it isn’t at all times that we have now to make use of Precision to guage our mannequin. It relies upon upon your use circumstances at all times!

I’d be using RandomForestClassifier operate with n_estimators be 100 (you may change this based on your will to get the optimum outcomes)

In the above code snippet, final time I match my mannequin with X_train and y_train.

Now, let’s have a look at the predictions. I’d be using predict operate and calculating Precision, Recall , f- rating, and Accuracy measure additionally.

Model Evaluation

Thus we see that our mannequin’s accuracy is approx 96 p.c which is I feel fairly first rate. Its precision worth can also be near 1, once more a good worth.

In my subsequent article, I’d use NLP and Neural Network and clarify how we are able to get a extra correct mannequin!

If you preferred this tutorial please do share it with your pals or on social media!

Want to have a chat about data science? Ping me on LinkedIn!


Source hyperlink

Write a comment