When to prefer Deep Learning over classical Machine Learning methods

Semih Akbayrak
4 min readMar 20, 2017

When I first heard the term deep learning, I thought it is nothing but a feed forward neural network with more than one hidden layers. But these were the times when I was new in machine learning area and machine learning was just a classification, regression, or clustering for me. And again in these times, I was interpretting ANNs as a machine learning methodology to handle the problems which can not be solved by linear models. Therefore it was not clear for me that why Deep Learning is so magical. Throughout the time, I learned different neural network types like Convolutional Neural Networks(CNNs), Recurrent Neural Networks(RNNs), Autoencoders, Deep Generative Models, etc. and reasons why Deep Learning is so popular, started to be more clear to me, gradually.

Deep Learning has a fundemental advantage against classical Machine Learning methods which is its power to perceive the world like a human being. It may seem a little bit science-fiction :) but this is my interpretation. In order to make it a little bit clear to you, let me give you an example from the Computer Vision field. Let’s assume there is a task to count the number of chocolates on the table and there are packeted biscuits on the table as well. So our main task is to first detect the packets, and then distinguishing chocolate packets from biscuit packets. If we would like to solve the problem with classical machine learning methods, first we have to take some preprocessing steps like segmentation which is actually another machine learning task. By segmentation, we first detect the rectangles on the table, then we use the features of the rectangles as input to classifier in order to determine the chocolate ones. So, before using a classifier you tried to grab the features and this process is called feature engineering. If you want to find healthy results, you have to be carefull in feature engineering so that you can grab the informative features from the data. On the other hand you can use CNNs to solve this problem without the process of feature engineering. CNN doesn’t need feature engineering because of its human like perception models. CNN perceives a photo starting from the pixels. It then detects the edges in the other hidden layer. Two edges creates corners, corners and edges create rectangles and other shapes and these shapes create the object itself. The information about the photo is flowing from layer to layer and this is the reason why we call it Deep Learning. So CNN detects the packets on the table without the assist of a separate segmentation part and it also makes inference by itself.

First example was about Computer Vision. In order to diversify the examples, I am going to give a Natural Language Processing example. Words have some semantic meanings for us. Their meanings change from context to context, but even so a word has a general meaning. When we hear the word Paris, the word France immediately comes to our mind because our mind is biased to associate Paris with France rather than celebrated Paris Hilton. Even Ankara comes into mind before Paris Hilton does, in the end both Paris and Ankara are capitals. Thus a question appears in here which is how can we represent these semantic meanings of words? About four years ago Mikolov utilized Autoencoders to find the word vectors. What he did is basically using a word as input to autoencoder and using adjacent words as output or vice versa. In the training part autoencoder gradually learns to reconstruct the adjacent words to the input word and the most inner hidden layer value becomes vector representation of the input word. After compeleting the training part, every word owns a vector which is indeed semantically meaningful. Because if you analyze the vectors you’re gonna see that the vectors of semantically close words lay close to each other in the space. Therefore inner products of vectors of two similar words give a high result. This method is called word2vec and it also has really interesting features as an example a good training process has an ability to produce vec[Ankara] if you make an arithmetic operation vec[Paris]-vec[France]+vec[Turkey].

These two examples can be seen as magic for the people who are not familiar with machine learning but it makes sense to us. Intuitively you are expecting to see these results because you are the one who designs the operations in deep learning, but I believe it is this intuition which makes deep learning models weak. It is a fact, deep learning is so powerful to perceive our world, deep learning also powerful in determination stage but it is also weak because deep learning is based on too much intuition rather than statistical fundementals.

Now, it is an important challenge for the machine learning scientists to combine the statistical foundations wtih deep learning models. Some of the important profs of Bayesian Machine Learning and their students have already started to publish revolutionary papers in this field. For the people who find these topics interesting I can suggest few names: Max Welling and Kingma, Zoubin Ghahramani and Yarin Gal, Shakir Mohammed and Rezende.

--

--