Convolutional Neural Network for dummies

Prasad
8 min readNov 28, 2017

--

Computers capable of thinking themselves were a topic of discussion for a long time. It was easy to get carried away by the hype and hyperbole created by this that many of the older generations brush aside these advancements as if those were scenes from a science fiction movie. Not anymore, there are valid enough proofs to prove that some of those fictions may be true, eventually. Currently, you can deposit a check to a machine and get the exact amount you need. The machine reads the amount in the check and deposits that amount to your account. Amazon can tell you the most probable items you may need based on the items you added to your shopping cart. Your email system can recognize which email may be important for you and categorize the emails based on the content of the email. Netflix may help you in choosing the next movie for you based on the movies you had already seen and the rating you gave for those movies. There are systems developed today which can predict the legal outcome of a case if you provided the proofs required. We can convert black and white to color. Here teachings from black and white photos of converted color photos can apply to photos where only black and white photos are available to create the corresponding color photos. And we had just scratched the surface of possibilities, there is more to come in the future.

Although computers amazed humans with their ability to crunch numbers fast, computers always lacked the human-like thinking capability. It was nothing more than a machine that always wanted some kind of help. But in the last decade, a lot of advancements made in Artificial Intelligence (AI) had happened, especially in the sub-domains of Machine learning and Deep learning. This is the technology behind all the experiences I just described in the last paragraph. Machine learning involves feeding data to a system and using powerful algorithms we create a model about the data based on its understanding of correlations of underlying data. Provided a set of parameters, the system can tell exactly about a dependent variable of that data. Welcome to machine learning. Now, if the machine goes wrong and if you can educate the system again on what had gone wrong, so that next time the machine will get a correct prediction, if something similar, that’s deep learning. Engineers especially in the technology domain are already spending their time designing systems that can ask the right questions and tackle intractable problems which traditional computers cannot solve.

As I intend this article to give a layperson an idea about how all these things work under the hood, I want to come with a simple example. We will take a simple challenge say given an image with numbers your computer needs to make the best guess about what’s the digit in the image. If you solve this problem and if you can scale it to much larger real-life problems, you can identify what’s written in signboard and even you can tell what language it is and what does that means in a context. If you have an autonomous car, your car should be able to understand the signboards. Isn’t it? :D

Take the example of the below picture. This is the pixel-by-pixel depiction of how the digits in an image will really appear as we zoom in. In the picture below we see 32 as a number represented in pixels.

Note that there are around 64 (8x8) pixels in each smaller square and a total of 6 such squares to form the wired mesh above. Now how we can detect pro-grammatically the digits in these pixels. One way of doing this is having filters. The idea is to slide 8x8 filters (something like shown below) over the above six boxes. For example, here is a filter that can use to detect a digit one (1)

similarly, for digit three (3), you can have something like this below

So when you slide these types of filters and count the number of black boxes you as the 8x8 filter moves thru the six boxes in the wired mesh, you may get 64 dark boxes. Then that means the number represented is totally obscured and the full 8x8 box will appear dark. The number in the filter matches the number present in the larger grid. Remember that each number is represented in a way so that no filter will provide exact 64 black squares if there is no match with the digit in the filter. That way you will come to know the exact number represented. Exactly this is the way our brain also detects an object, but it does that efficiently, and it's really fast in doing that. Here we have ten filters for each digit, but with our eye, it will be millions of filters applied on what we see, that too in a nick of a second. Finally, what it does is to do some analysis on the results to determine what we are seeing.

Convolutional Neural Networks (CNN)

Machine learning is done by mimicking how exactly the cells (neurons) in the human brain behave. The problem discussed above is very much analogous to the way our brain detects objects thru our eyes. The above problem is an example of a Convolution neural network (CNN), but this is a very simple case, in reality, the problem space we have is much complex than this. But we can scale the model used to solve real-world problems, like that faced by an autonomous car that tries to understand the surroundings. For real-life problems, we need powerful computers to do all these calculations. Sometimes it involves data centers running for weeks to train about a problem based on its complexity. Sometimes the filters like the above should be repeatedly applied to (epochs) solve a problem.

The strides the square makes over the bigger wireframe is convolution. The filter here in this case is taking a stride size of 8 (ie the size of each square) pixels as it proceeds with its scanning. In one stride for each square, it gets the number of black squares as numbers. Now in the next step, it will calculate if it got 64 black squares. If yes, that’s the number represented by the filter, otherwise, the number of the filter represented can be omitted from consideration. This is a Convolutional neural network as simple as it can get to.

Now, let's analyze the way the neurons in our brain work, and let's discuss the structure of a neuron and try to map the above problem which we solved to the way neuron will handle. As we go along, we will try to relate it to some machine learning jargon too.

In the above picture, you can see the neurons. Here one neuron will be connected to the next neuron thru connectors called Synapse. The signals coming from the axon of a neuron pass thru the synapse at the end of the axon and it transmits the signals to the next dendrite which is a part of another neuron. Thus it forms a network of neurons. Below is a pictorial representation of connected neurons.

We classify the data coming thru several dendrites based on some logic (based on its previous learning) and then it gets transmitted to the nearby neuron. The logic applied to the wireframe is analogous to the function of the above neuron. It gets some data then transforms into more meaningful content and then transfers to the next filter. Here one filter can tell if it's the number that it expects or clearly tell the number that it is not. Each layer you go you get a deep understanding of the number which the wireframe represents. Now it may seem simple, but you have 100 billion of them in a human brain. Now you can imagine how complex the brain works. Also, each neuron may be as small as 4 microns.

Machine Learning

Now, let's bring some Machine learning terminology into these scenarios we discussed. The above scenario that we discussed may be good for understanding the number displayed on a wireframe. It requires complex models to understand 2D images. For 3D images the complexity increases. The number of layers where filters are applied also increases dramatically. The model we adopted maybe suited well for identifying numbers, but say if it's a speech and you want to identify what number was pronounced, it requires totally different model, although the underlying principle of layers for classifying to reach a conclusion about data is same.

One of the widely used models for object detection is Alexnet. This is a network like above but for determining what is in a 2D image. The network layers are as below

As you can see this has much more layers(filters) applied. This has an accuracy rate of above 80% in determining what exactly is in a picture. This was one model which came as a winner for the Imagenet competition where 12 million images were classified into 1000 categories. Similarly, lot of models like Resnet, MSNIT, GoogleNet, LeNet were developed for different applications.

Now in the above example, there is no learning part involved. It's more about extracting the features from the data available (feature extraction in machine learning parlance). So comes the second part of learning called backpropagation.

Learning: In the scenario above we are using filters that we took from somewhere. Imagine a scenario where you need to build your own filter. When a baby is born his/her brain is as a clean slate. He/She has to generate such filters, how does that happen? How the previous learning can help in generating such filter layers in your brain? See the cat experiment by Hubel and Wiesel

https://youtu.be/4nwpU7GFYe8

https://youtu.be/IOHayh06LJ4

Interesting right? So here what happens is your previous learning helps in defining these filters which the brain uses next time such a scenario arises. Each filter is given a weight. Each time the result of applying that filter resulted in a right or wrong conclusion and they give this as weight to the decision you had taken. This is called backpropagation in machine learning. The method of backpropagation is used for adjusting the weights of the nodes(in the above example, filters) in the models we use. This is the learning part.

I had not explained the statistical and mathematical principles followed for implementing all these. There are many frameworks created in computer software to deal with all these.

Happy Machine Learning! :)

--

--

No responses yet