Artificial Intelligence (AI), is a broad and complicated branch of computer science, a further subcomponent being Machine Learning which concerns itself with... you guessed it... learning. At Translucent Computing, we are currently developing applications that use machine learning algorithms. Machine learning is not a new field of study, but with the exponential growth in computer processing power we can build systems now that are cable to learning and processing data in real time. One of the popular examples of machine learning is object recognition. Object recognition refers to the tasks of finding, identifying, and processing objects in an image or a video. Object recognition these days can be found in a lot of places, from robotic control systems to face detection in your mobile phone. To show a full use case from bottom to the top, I will demonstrate machine learning performing object recognition on human handwriting. We will teach the computer to recognize hand-written numbers.

We started the machine learning process with creating a training data set that can be processed by a computer. Like teaching a child, the best way to teach a computer to recognize an object is to show an image of the object to the computer. Since we are teaching the computer to recognize hand-written numbers we start with generating a data set in our drawing application, Cloud Doodle. Here is a sample:

We used the drawing app to draw multiple version of each number from 0-9. We then created individual files from each number so that it would be easier to submit the images to the learning algorithm.

After creating the initial data set, we began to build the learning system. The best way to prototype a learning system is to use Matlab or Octave. Both math programs have the necessary tools for developing a learning system quickly and efficiently. The most-used functionality in developing machine learning algorithms that are provided by both programs are support for linear algebra with matrix-based syntax, and statistical libraries. We used Octave for this blog since it is an open source software and available for all OS's.

The are multiple types of machine learning algorithms. The two major ones are

*supervised learning*and*unsupervised learning*. Supervised learning algorithms are trained on labeled data and unsupervised learning operate on unlabeled data. Labeled data refers to data where we know the input and the corresponding output. Our machine learning algorithm is a supervised learning algorithm, since we have the input and we know what the output should be. If we pass the image of number 9 into algorithm the output should be the number 9. Through this supervised learning method we can teach the computer to recognize hand-written numbers. It learns from every image that is passed into the algorithm. With each new input its understating of the numbers improves. There is point where more data will not make much of a difference in the accuracy of recognizing the correct number. To simplify this blog we will defer the discussion of bias and variance to another time. There are several ways to implement a supervised learning algorithm, in this blog we will limit the discussion to neural networks.
Artificial neural networks (NN) have been inspired by the human brain and they try to mimic the interconnected neurons in the brain. There are many different types of NN, here we will only concentrate on the most common one: a

*fully connected feed forward multilayer perceptron network*. Here is an example of such NN and it is the same network architecture that we used for our NN.
As you can see it gets complicated fairly quickly so I’ve only created a diagram of a network with 3 inputs and a bias input, a hidden layer with 3 neurons and a bias node, and one neuron in the output layer. The network in the diagram has a binary output which is either true or false. The architecture used in our NN consists of 1601 inputs, 101 neurons in the hidden layer and 10 output neurons. The size of the input layer in our NN is made up of all the pixels in the 40x40 pixel image that makes up the image of the hand-written number, plus the bias input. The 10 outputs are for the numbers 0 to 9. There are no hard rules when it comes to deciding on the NN architecture. You want to come up with the most efficient network you can. We ran the training set with multiple configurations and with this particular training data set we found that the best accuracy occurred when we set the hidden neurons to 100. Also, for this blog we chose to use only one hidden layer to simplify the calculations and there was no significant gain in accuracy when we tested the network with multiple hidden layers.

Each connection between two nodes represented in the diagram has a weight associated with it. The training process adjusts the weights and through this adjustment the network learns. To reduce the clutter in the above diagram I’ve omitted the weights that we are determining during the learning process. I’m going to use Θ (theta) to represent the weight between the connected nodes. To give an example of how to process such a network I’m going to start with node a

_{1 }in the hidden layer from the diagram above. Since this is a fully connected network all the inputs feed into node a_{1}. When used in our NN all the pixels from the image are feeding into the that node. The extra input node x_{0}is called the bias node and for now we will ignore it. Together with the weights here is the formula for the node.
a

_{1 }= g(Θ_{10}x_{0}+ Θ_{11}x_{1}+ Θ_{12}x_{2}+ Θ_{13}x_{3})
This mathematical representation is the same as the visual representation above which could be described thusly:

1.All the inputs feed into the node.

2.The inputs are adjusted by the weight and are processed.

3.The output from the node is fed into the input of the next node.

The function g() in the formula is called the activation function and it’s the method by which the inputs are processed. In our NN we choose the logistic function for the activation function. The logistic function is a sigmoid curve represented by the question:

The curve looks like this:

The logistic function limits the output from the neurons to values between 0 and 1. This is not the best biological representation of a neuron, but this function has some nice mathematical properties and it works well for object recognition. One of the attractive mathematical properties of the sigmoid function is the easy of calculating the derivative which we will exploit in the learning algorithm implementation. The bias node, x

_{0}, is not always used in the NN; we chose to use the bias node to give more flexibility to the NN. Here is a graph that shows what happens when a bias unit is applied to the sigmoid curve.
By adding a negative or positive bias, you can shift the sigmoid curve to the right or left.

To have an effective learning algorithm it needs to be able to learn from its mistakes. The algorithm gets feedback from the expected result and the actual result and it needs to adjust its 'thinking' and correct any discrepancy between the two results. If we show an image of number 5 to the algorithm and it thinks that it is a different number, ex. 2, it needs to be able to dynamically adjust all the weights in the network and correct its thinking to come up with a correct result. This process of self-correction is implemented with a backpropagation algorithm. The result from recognizing an object is back propagated through the network and an error is calculated at each node. To have an accurate NN we need to minimize that error at each node. The error function for our NN is:

- The second term in the equation is the regularization term and it effects the variance/bias of the NN. I will defer analyzing that term to the variance/bias blog.

- m is the number of training examples, all the different images of hand-written numbers.

- k is the number of different outputs, in our case k = 10 and it represents numbers from 0 - 9, 0 is mapped to 10.

- y is the results we are looking for.

- g

_{Θ}(x) is the result we get from the network.
Our goal in training the learning algorithm is to minimize the f(Θ) - to find the lowest error. To visualize the process, here is an example 3D surface plot for a neuron with only 2 connections.

The minimum error is in one of the valleys in the plot. To find the minimum error you need to traverse the plot and at every step check if you are at the minimum, sort of like going down a mountain. The way you do this mathematically is to take the derivative of the error function to calculate a slope and take a small step down the slope. Iterating through this process of going down the slope is called gradient descent. There are a few parameters that you can adjust to help the gradient decent to converge to a global minimum. One of the parameters is the size of a step that you take down the slope. Smaller steps will lengthen process to find the global minimum, but if the steps are too long you might overshoot the minimum. We need to calculate the partial derivative of f(Θ) if we want to use the gradient descent. In the backpropagation algorithm when we go backward through the network and calculate the error for each node, at the same time we calculate the partial derivative for each node. At the end we add up all the results and we get the partial derivative for the network. Choosing the sigmoid function made it easier to calculate the derivative during back propagation. Here is the derivative for the sigmoid function:

Using the backpropagation algorithm with gradient decent we ran the learning process multiple times with multiple parameters. To get some feedback on which parameters were improving the accuracy, we generated graphs that showed off the system accuracy with respect to different parameters.

After we trained the system with the initial data set, we tested the system on unseen data, that is, data that was not used to train the algorithm. The unseen data had a very low accuracy as compare to the data that was used to train the system. After further testing it was clear that the system was not generalizing well. This means that it learned only the numbers that it seen but it failed to understand the numbers it had not seen before. The understanding of numbers was not generalized to similar looking numbers but only to very specific images of the numbers. A small change in the structure of the number and the system was unable to recognize the number. This is a common issue with machine learning and the concepts of variance and bias help to modify the algorithm and make more generic. As I’ve said before we will come back to this in another blog.

After we finished with training the learning algorithm we saved the weights into a csv file and we decide to use them in an Android application. The weights are like our memories. Together with the NN architecture define the learning system. We implemented the same architecture and used the same weights in an Android app to test the system. We create a simple app that reads images of hand-written numbers and passes data from the images to a NN for processing. Here is a video of the application working :

Since this is already a lengthy blog I will post and comment on the Android code in another blog and talk about the variance and bias in yet another blog.