Explain by Example: Deep Learning (NN) | by Michelle Xie | Nov, 2020
I was at an airport recently — a statement I didn’t think I would be able to make for another year or so. And I don’t know about you guys but I feel like I have been cursed by the airport gods because for some reason, I can never get past baggage check-in without overweight bags. And that got me thinking, what if I could use artificial intelligence to pack my bags for me. What if they can help me optimize the type of items I want in my luggage whilst also managing the weight of each item and the value or importance of each item for me? So I started digging around and decided to read up on deep learning…
What is deep learning?
For the longest time, I always had trouble figuring out if deep learning is part of machine learning or if machine learning is part of deep learning or if they are both two distinct areas of AI and the best analogy I can come up with is to think of machine learning as the field of psychology and deep learning as the field of neuroscience. Now both psychology and neuroscience look into how the brain operates. Psychology focuses more on conducting social experiments and observing human behaviors and interactions to provide an explanation to some of the proposed theories and hypothesis while neuroscience taps into the biology of the brain to really understand how the brain works at that biological level. If we apply that to machine learning vs. deep learning, whilst they are both methods or techniques for creating intelligent machines or intelligent software, machine learning focuses more on applying well-known statistical methods to data to create predictive models and deep learning on the other hand focuses more on simulating the decision-making process that the human brain makes to create decisive or human-like intellectual machine models.
Don’t worry, I’ll try not to flex too many big, fancy technical terminology. After all, my vocabulary (at best) is still quite limited especially in this field.
Anyway, as I was saying, deep learning is really about creating artificial neural networks that simulates the way our brains work. These artificial neural networks are a way of simulating the way our synapses in our brains or neuron s in our brains may get triggered and fire off other neurons to control some our movements, react to something, or even make a decision. Which brings me to…
What are neural networks?
You mean these scary looking diagrams here? They’re actually not that scary once you break it down but they have been proven to be quite powerful and is actually the foundation of a lot of the really powerful and complex artificial intelligence technologies these days like computer vision and natural language processing which has lead to bringing science fiction like self-driving cars to reality.
Let’s break down a neural network then, shall we?
Neural networks are made up of layers and layers of neurons.
There are 3 types of layers you need to know about, the input layer, the hidden layers and the output layer.
The input layer allows you to match the features of your data as input into the neural network, the output layer allows you to determine the number of labels or classes you want your neural network to predict or output and then everything in between is…
You guessed it, the hidden stuff.
Now the scary diagram I started off with is known as a fully connected neural net because as you can see, each neuron (represented by one of these colored circles or nodes) in each layer is connected to each of the other neurons in the next layer over.
Ok, let’s go back to my overweight baggage problem and replace these nodes (or neurons) with something from my suitcase:
Let’s say I have some data features which represents my suitcases e.g. the size of the suitcase, the make of the suitcase, how long the suitcase will be in transit for, what airline I’m flying with, and so on. So I’m going to map each of these features with an input node in the input layer. And let’s say that my hidden layers are made up of clothes that together will form an outfit e.g. I need a top, a bottom, and some shoes. I’ll call the first hidden layer the top layer, the second hidden layer the bottom layer, and the third hidden layer the shoe layer. And then finally, I have the output layer which just represents the outputs I want from my neural network model which in this case just tells me if I should pack or not pack a particular combination of clothes into my suitcase to take with me to the airport. Now let’s connect these layers together:
If you have ever picked an outfit before, you know that you can’t just pick out the top, the pants and the shoes all in one go whether you care about fashion or not. You typically start with one item, pick out the next item, the next item after that and so on. Sometimes you might even pick out two or more items until you’ve finally decided what to wear. The same goes for a neural network. You start at the first layer, you pick some nodes or neurons to activate, sometimes you could activate a couple of them and then you move onto the next layer to activate the nodes or neurons in the next layer and this continues until you reach the output layer. This is called forward propagation. You’re forwarding some of the choices you’ve made in the previous layers onto the next layer to help the neural net decide which nodes to fire off next based on some of the intermediate decisions you’ve already made.
Now let’s say for example, after one iteration, our neural net activates the neurons in our net in this manner:
So our neural net has told us to pack the combination of clothes inside the suitcase #1. But then we get to the airport, we find that our baggage is still overweight or it packed illegal contents into the bag or even worse…it packed mismatching outfits for you. What a disaster for your Instagram!
So basically, our model got it wrong.
What can we do about it?
Well, to ensure that our neural net can help us pack the right suitcase, we will need to run through several iterations to train the neural network and during each iteration, we might need to tweak the weights and bias of the neurons in our model.
Weights and Bias?
Think of the weights and bias as the importance and preference for an item in your suitcase. Let’s say to pack your bags for the top layer you have the option of picking from 5 different t-shirts, 5 different singlets, and 5 different long sleeve shirts. Now how do you decide whether you need t-shirts, singlets, or long sleeve shirts? You’re probably going to check the destination you’re going to. If the destination is going to be hot, you’ll probably prioritize t-shirts and singlets. If the destination is cold, you’ll probably prioritize long sleeve shirts. In the same way, you’ll need to train your neural net on what it needs to prioritize for you by using weights. Now, even amongst the shirts, singlets, and long sleeves, you probably have some sort of preference for one top over another because let’s be honest, everyone has favorites. In that case, you might want to use the bias to adjust the model to favor the nodes that contains your favorite pieces of items.
How do we activate these neurons or nodes?
After running out of kdrama to cry over, I decided to step it up with some fun maths from this lecture and figured out that a neuron is actually made up of a linear part and an activation part which we can represent it as:
neuron = linear + activation
Think of the linear part as a constant so what really gets the neuron all fired up is the activation component which is determined by an activation function. Now the activation function depends on 3 things: the feature, the weight and the bias.
Let’s take our first hidden layer as an example:
There are only two nodes or neurons in this layer e.g. t-shirt A and t-shirt B. Now to decide if we want to fire off t-shirt A, we need to look at the value returned by the activation function (y1) which will then determine the overall value of that neuron. The same goes for t-shirt B (y2).
Now that σ symbol which looks like an ‘o’ with a cool hair style represents sigmoid which is a function that by nature will be able to take any number from negative infinity (-∞) to infinity ( ∞) and map it to some value between 0 and 1 which I think is pretty cool. But it does more than just being cool, it means we can take any number and represent it with a standard, finite range of values between 0 and 1 if we ignore the fact that there are actually small infinities within the range of 0 and 1 but that’s not the point here. The point here is that now we can establish some boundaries like 0.5 for example so if our y1 returns anything in the range of 0.5 to 1, we fire off the t-shirt A neuron. And if y1 returns anything in the range of 0 to 0.4999…, we don’t fire off the t-shirt A neuron.
Same goes for t-shirt B. So our little ‘o’ with a cool hairdo (σ) is great for binary classification problems like to activate or to not activate.
Ok, so we can tweak the weights and bias in the neurons of model but how do we determine what values to tweak them to?
Introducing loss functions
Without some help, you and I will probably be equally as lost when it comes to picking out what weights and bias to choose for our model so we need something called a loss function (or cost function) to help us out. The loss function tells us how far off our model’s predicted values were compared to the actual values. To find this, we might compute the positive difference between the actual value (y) and the predicted value (ŷ) or sum the squared variances of the two values:
loss = actual - predicted = | y - ŷ | or ∑( y - ŷ )²
Yeah ok, but how does that help us?
Well, if we know how much our model is off by, we can try and tweak those weights and bias values (the parameters in our model) so that we can reduce the amount of loss (or wrongness) that is in our model. Basically we want to close that gap between the predicted values from the model and the actual values. In other words, minimize the value returned by our loss function. Now, if we go back to high school calculus, remember when we learned how to calculate the derivative of a function to find the slope or rate of change of a function? If you don’t remember it, that’s fine, my 16 to 17 year old self is judging my current calculus skills really hard right now too and imagine being judged by your own teenage self!
Anyway, we want to find the derivative of our loss function to find the optimal values of the weights and bias parameters to tune our model to. These derivative are called optimizers. To try to reach the optimal value, we want the derivative of our loss function to be 0 so the slope of the function needs to be flat. That’s when we know we’ve reached the local optimum of that function.
In the context of our baggage problem, let’s imagine our suitcase is the loss function and the scale is the derivative of the loss function. Now if we place our suitcase on the scales (e.g. take the derivative of the function), we may find that it is currently underweight (e.g. the derivative is negative), this means we can still add more items to our suitcase. In other words, we still have room to increase the weight of our suitcase until it reaches the optimal weight (the weight limit designated by that airline). Similarly, if our function is increasing, that is, the derivative of our loss function is positive, we will need to decrease the weight to reach our optimal value of 0 loss.
We typically use optimizers to learn the optimal values for both the weight and bias of our neurons and then we pass these values back to our model. This process is called back propagation because we start at the last layer and we decide what values to tune our weights and bias parameters to, then we move back a layer and do the same and then we just keep moving back through the layers to tune our parameters so that hopefully when we run through another iteration of our neural network, we are able to get a predicted value that is much closer to our actual value.
Why do we start at the back?
Have you have ever had to repack your bags at the airport? Because I have and it is not fun. The best way is always to start by taking items off the top layer of your suitcase, try and squeeze it into your carry-on before you remove items from the next layer down rather than emptying all the contents in your suitcase and then repacking everything from the start. By back propagating through your suitcase, you have a better judge of what items to remove. In the case of neural networks, back propagation helps you decide how much to tune your parameters because all the layers preceding it feed into the later layers. Sort of like how the earlier items you may have packed will dictate how much room you have left to pack some of your other remaining items.
“Your plane is now boarding at gate…”
Anyway, I have to go. There are a lot of different types of neural networks and I’ve barely touched on the surface of artificial neural networks. Just know that deep learning is a heavily researched and advancing field in the space of AI mostly because some of its applications have huge commercial benefits and successes. If want to learn more about deep learning, go check out the Microsoft Learn platform, there are lots of free content that basically covers anything and everything and did I mention…it’s free?
Oh yeah and before I go, I should mention I also have a Twitter account that I don’t really use but I have one if you want to follow me (@mishxie). Maybe you can teach me a few tips and tricks on how to use Twitter properly.
As always, all feedback whether they are positive, neutral or negative are welcomed 😊
Got to run now, take care!
Read More …