 ## Stochastic gradient descent implementation for SoftSVM | by Jaime Dantas | Dec, 2020

We will use a 4-dimensional dataset with 1,372 data points for this classification. This dataset is the banknote authentication dataset from the UCI repository. All source codes used in this tutorial are available in the repository below.

After downloading the data_banknote_authentication.txt file, we’ll import it on MATLAB in Home > Import Data. We need to import them as a Numeric Matrix as shown below.

Note that the last column of this dataset represents the class 0 or 1. We’ll replace this label with +1 and -1 later on.

Stochastic gradient descent is an interactive method used in machine learning for optimization problems. Unlike the gradient descent (GD) alternative, SGD uses random data points to calculate the direction of the gradient on each interaction. On expectation, the SGS converges to a minimum of the convex.

There are several advantages when using SGD against GD, but we won’t dive deep into the details of these two machine learning methods. If you want to know more about it, this article can help.

Support Vector Machines (SVM) is a linear classifier that can be viewed as a similar algorithm to the Perceptron. There are two types of SVM: The hard margin and the soft margin SVM. The former is used when we have a linearly separable dataset, and the latter is for not linearly separable data. For SoftSVM, we allow ourselves to have some slack so we can create a linear classifier even if the data is not linearly separable. The figure below shows a dataset where both the HardSVM and the SoftSVM are identical.

Enough said, let’s implement the algorithm! We will implement a general linear classifier by adding a column of 1 to the design matrix. During the optimization, we’ll keep track of both the empirical binary loss the empirical hinge loss of the current weight vector w, and we’ll output the last w vector as our predictor.

First, I loaded the dataset into MATLAB. Then, I created a function that receives the x, t and λ. The number of updates T is 500 in this implementation. Before doing any operation with the dataset, I shuffle it to eliminate any sort of bias that may exist (even though for SGD in SoftSVM it does not matter since it guarantees randomness). I then add a column of ones to the matrix x and initialize with θ = 0. The algorithm is shown below .