Introduction to Neural Networks Basics
This is the first part of a series of blog posts on simple Neural Networks. The basics of neural networks can be found all over the internet. Many of them are the same, each article is written slightly differently.
But here we tried a different approach to get a deep understanding of the neural networks by explaining each building block concept to build the neural network.
Literally, we will narrow down to the very basic concepts you should need to build the neural networks. The knowledge you gained in this article will help you understand the various deep learning models architecture in the long run.
Many of us have seen the pocket calculator in an arithmetic contest. It will never improve its speed or accuracy, no matter how much it practices.
It doesn’t learn.
For example, every time I press its square-root button, it computes exactly the same function in exactly the same way. Here the pocket calculator is not learning.
But how can it learn?
By computing a function. Our brains can also learn much more efficiently based on the same idea. Before delving deeper into how such networks can learn, let’s first understand how they can compute.
This computing function is called neural networks models in deep learning, in machine learning literature it’s called a machine learning model.
Now let’s learn how the neural networks learn from the data we are feeding.
Introduction to Neural networks
A neural network is simply a group of interconnected neurons that are able to influence each other’s behavior.
Your brain contains about as many neurons as there are stars in our galaxy. On average, each of these neurons is connected to a thousand other neurons via junctions called synapses.
We can schematically draw a neural network as a collection of dots representing neurons connected by lines representing synapses as shown in the below figure.
Real-world neurons are very complicated. However, AI researchers have shown that neural networks can still attain human-level performance on many remarkably complex tasks.
Such as hand written text recognition, identifiying cancer tumers ..etc
Even if one ignores all these complexities and replaces real biological neurons with extremely simple simulated ones that are all identical and obey very simple rules.
Currently the most popular model for such an artificial neural network represents the state of each neuron by a single number and the strength of each synapse by a single number.
In this model, each neuron updates its state at regular time steps by simply averaging together the inputs from all connected neurons.
Weighting them by the synaptic strengths, optionally adding a constant, and then applying what’s called an activation function to the result to compute its next state.
The easiest way to use a neural network as a function is to make it feedforward, with information flowing only in one direction.
In case you like math, two popular choices of these activation functions.
- Sigmoid Function
- Ramp Function
Famous model uses
ƒ(x) = -1 if x < 0 and ƒ(x)= 1 if >= 0.
If the neuron states are stored in a vector. Then the network is updated by simply multiplying that vector by a matrix storing the synaptic couplings and then applying the function ƒ to all elements.
Simple neural networks are universal in the sense that they can compute any function arbitrarily accurately by simply adjusting those synapse strength numbers accordingly.
When I first learned about neural networks, I was mystified by how something so simple could compute something arbitrarily complicated.
For example, how can you compute even something as simple as multiplication, when all you’re allowed to do is compute weighted sums and apply a single fixed function?
How this works is shown in the below figure.
Which shows how a mere four neurons can multiply two arbitrary numbers together, and how a single neuron can multiply three bits together.
Now let’s see a hello world example of neural networks.
Suppose that we wish to classify megapixel grayscale images into two categories, say cats and dogs. If each of the million pixels can take one of say 256 values then there are possible images for each one.
We wish to compute the probability that it depicts a cat. This means that an arbitrary function that inputs a picture and outputs a probability is defined by a list of probabilities i.e., way more numbers than there are atoms in our universe (about ).
Now we have the idea of how neural networks work. To frame it simple.
“Fire together, wire together”
Let’s see the math behind the neural networks.
The math behind the neural networks
At each node in the hidden and output layers of the neural networks (NN) an activation function is executed.
The activation function can also be called a transfer function. This function takes in the output of the previous node, and multiples it by some weights. The weights that come out of one node can all be different, that is they will activate different neurons.
There can be many forms of the transfer function, we will first look at the sigmoid transfer function as it seems traditional.
Here we are going to refer below index’s:
The activation function at a node j in the hidden layer takes the value:
At each hidden layer node, multiply each input value by the connection received by the node and add them together.
In our case, this is the final output layer. So for each of the k nodes in K:
As this is the training phase of our network, the true results will be known when we calculate the error.
Whas is Error
We measure error at the end of each forward pass. This allows us to quantify how well our network has performed in getting the correct output. Once the neural networks build completed. We can use the various evaluation metrics to measure the performance of the model.
Good! Now how does this help us?
Our aim here is to find a way to tune our network such that when we do a forward pass of the input data, the output is exactly what we know it should be. But we can’t change the input data, so there are only two things we can change:
- The weights going into the activation function.
- The activation function itself.
The second case will be considered as a separate blog post since there are a lot of activation functions, but the magic of neural networks is all about the weights.
Getting each weight i.e. each connection between nodes, to be just the perfect value, is what backpropagation is all about. We’ll look at the backpropagation algorithm in the next section.
But let’s go ahead and set it up by considering the following:
How much of this error E has come from each of the weights in the network?
The derivative of the error function w.r.t weights is then:
We group the terms involving k and define:
When calculating the errors, special care needs to be taken in the form of the loss function. As the neural networks will tend to overfit the data if the data we provided is not diversified enough.
Even though we have various ways to create more diversified data with the available data, it’s still worth keeping this in mind.
How Back Propagation Works
Backpropagation takes the error function and uses it to calculate the error on the current layer and updates the weights to that layer by some amount.
So far we’ve looked at the error on the output layer, what about the hidden layer?
Now, unlike before, we cannot just drop the summation as the derivative is not directly acting on a subscript k in the summation. We should be careful to note that the output from every node in J is actually connected to each of the nodes in K so the summation should stay.
But we can still use the same tricks as before: let’s use the power rule again and move the derivative inside (because the summation is finite):
Let’s use the chain rule to break apart this derivative in terms of the output from J:
Therefore this derivative just becomes the weights . The final derivative has nothing to do with the subscript k anymore, so we’re free to move this around — lets put it at the beginning:
Almost there! Recall that we defined earlier, lets substitute that in:
To clean this up, we now define the ‘delta’ for our hidden layer:
That’s the amount of error on each of the weights going into our hidden layer:
What is Bias
Let’s remind ourselves what happened inside our hidden layer nodes:
- Each feature from the input layer I is multiplied by some weight .
- These are added together to get the total, weighted input from the nodes in I.
- is passed through the activation or transfer function, .
- This gives the output for each of the j nodes in hidden layer J.
- from each of the J nodes becomes for the next layer.
When we talk about the bias term in neural networks, we are actually talking about an additional parameter that is included in the summation of step 2 above.
The bias term is usually denoted with the symbol θ (theta). Its function is to act as a threshold for the activation (transfer) function.
Given the value of 1 and is not connected to anything else. As such, this means that any derivative of the node’s output with respect to the bias term would just give a constant, 1.
This allows us to just think of the bias term as an output from the node with the value of 1. This will be updated later during back propagation to change the threshold at which the node fires.
Now we have all the pieces to understand the neural networks. The bias we are talking here is completly different from the bias-variance tradeoff in machine learning.
We’ve got the initial outputs after our feed-forward, we have the equations for the delta terms (the amount by which the error is based on the different weights) and we know we need to update our bias term too.
So what does it look like:
1. Input the data into the network and feed-forward.
2. For each of the output nodes calculate:
3. For each of the hidden layer nodes calculate:
4. Calculate the changes that need to be made to the weights and bias terms:
5. Update the weights and biases across the network:
This algorithm is looped over and over until the error between the output and the target values is below some set threshold. Depending on the size of the network i.e. the number of layers and number of nodes per layer, it can take a long time to complete one ‘epoch’ or run through of this algorithm.
In the next article, we’ll discuss different types of activation functions. If you have FOMO “fear of missing out” please follow us.
If you like the article share it, if not tell us. Be like a neural network, learn from mistakes.