Stochastic gradient descent implementation for SoftSVM | by Jaime Dantas | Dec, 2020


We will use a 4-dimensional dataset with 1,372 data points for this classification. This dataset is the banknote authentication dataset from the UCI repository. All source codes used in this tutorial are available in the repository below.

After downloading the data_banknote_authentication.txt file, we’ll import it on MATLAB in Home > Import Data. We need to import them as a Numeric Matrix as shown below.

MATLAB import window

Note that the last column of this dataset represents the class 0 or 1. We’ll replace this label with +1 and -1 later on.

Stochastic gradient descent is an interactive method used in machine learning for optimization problems. Unlike the gradient descent (GD) alternative, SGD uses random data points to calculate the direction of the gradient on each interaction. On expectation, the SGS converges to a minimum of the convex.

There are several advantages when using SGD against GD, but we won’t dive deep into the details of these two machine learning methods. If you want to know more about it, this article can help.

Support Vector Machines (SVM) is a linear classifier that can be viewed as a similar algorithm to the Perceptron. There are two types of SVM: The hard margin and the soft margin SVM. The former is used when we have a linearly separable dataset, and the latter is for not linearly separable data. For SoftSVM, we allow ourselves to have some slack so we can create a linear classifier even if the data is not linearly separable. The figure below shows a dataset where both the HardSVM and the SoftSVM are identical.

The case where the SoftSVM is equal to the HardSVM. Image by Author.

Enough said, let’s implement the algorithm! We will implement a general linear classifier by adding a column of 1 to the design matrix. During the optimization, we’ll keep track of both the empirical binary loss the empirical hinge loss of the current weight vector w, and we’ll output the last w vector as our predictor.

First, I loaded the dataset into MATLAB. Then, I created a function that receives the x, t and λ. The number of updates T is 500 in this implementation. Before doing any operation with the dataset, I shuffle it to eliminate any sort of bias that may exist (even though for SGD in SoftSVM it does not matter since it guarantees randomness). I then add a column of ones to the matrix x and initialize with θ = 0. The algorithm is shown below [1].

Our algorithm will output the last vector w as well as a cell array with the binary loss and another one with the hinge loss with the values for each execution Ti.


We’ll run our algorithm against the dataset with various values for the regularization parameter λ = {100, 10, 1, 0.1, 0.01, 0,001}. I created the run_softSVM.m script to run the SoftSVM algorithm with all λ.

At the end of the run_softSVM.m, I output the value and position of the smallest binary loss out of the 500 executions for each chosen λ. The figures below show the charts for the binary loss with λ = 1, 0.001 and the hinge loss for λ = 100.

Binary loss for λ = 0.001 and 1, and hinge loss for λ = 100 in sequency

Since we update the gradient and calculate the loss based on a random data point for each interaction, we expect to see some noise on all curves. Below we can see the minimum binary loss for each value of λ we ran our algorithm.

That being shown, let’s first analyze the binary loss for λ = 0.001. Even though the curve is not totally monotone, we can see that it is approximately monotone for this case. It has a decreasing tendency. This is seen when we draw a decreasing trend curve for this graph as shown in the figure below.

The binary loss for λ = 0.001 with a decreasing trend curve

In other words, the larger the number of interactions, the smaller the binary loss. However, after execution 384, we start to bounce around the minimum and we start to get slightly larger losses. The binary loss for λ = 1 showed a similar behaviour too, with a decreasing tendency. This time, however, we reached the minimum binary loss in execution 367. So, the algorithm converged quicker for λ = 1 for this case. In addition, we have a slightly smaller binary loss when we run with λ = 1 compared to λ = 0.001. So, if we use the last interaction w as our classifier for both these cases, we would be fine.

Moving on to the hinge loss for λ = 100, we can clearly see that it is not monotone since it begins to increase after around interaction 150 or so. In fact, if this time we pick the last interaction w, we would have a large magnitude difference in error compared with the best predictor. This is because the minimum hinge loss for this case is on execution 83. Also, the minimum hinge loss for λ = 100 was larger than the other cases (λ = {10, 1, 0.1, 0.001}).

Therefore, the value of λ plays an important role in the number of interactions the algorithm takes to converge and the minimum binary loss. From this analysis, we saw that choosing a larger λ, let’s say 100 or 10, can lead to a faster execution time with a slightly larger loss. Additionally, very small values of λ, let’s say 0.1, may require more interactions to converge, but it can give a smaller loss. This induces that we should test different values of λ to tune the Soft SVM with stochastic gradient descent in order to find out the best parameter for the minimum binary loss with the least number of interactions (to save precious computer execution time).

We can even combine different values of λ in our algorithm, so we can run it with a large value of λ for the first interactions then change it to a small value of λ for the remaining interactions to converge faster. Even though the binary loss is a great indicator that we are converging, it is important to use the surrogate functions like the hinge loss to analyze whether or not the predicted vector w gives us a good classifier.

We saw that SGD converges on expectation even though it takes random steps towards the minimum on each interaction. We implemented the SGD for SoftSVM on MATLAB and analyzed the impact of having different values of regularizer λ.

Choosing the ideal value of λ proved to be quite challenging, and it required a detailed analysis of the different values of λ. We found out that we can even combine distinct values of the regularize during different parts of the algorithm to improve performance.

I hope you have learnt a bit more about the importance of the regularizer on SGD in SoftSVM.

I’m an M.A.Sc. student at York University, and a Software Engineer by heart. During the past decade, I’ve been working in several industries in areas such as software development, cloud computing and systems engineering. Currently, I’m developing research on cloud computing and distributes systems.

You can check my work on my website if you want to.

Thanks for reading it!

[1] Shai Shalev-Shwartz and Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. DOI:10.1017/CBO9781107298019. URL:

Read More …


Write a comment