Perceptrons¶
these notes are, in part, based on Eugene Charniak’s book Deep Learning; it’s a good book for introducing the basic ideas and intuitions of deep learning, and so you should buy a copy!
computer vision is one of the most popular problems in machine learning
one well-known test machine learning test is recognizing hand-written digits
- in the popular Mnist hand-written digit data set, all images are 28x28 images of pixels, where each pixel is a number representing a shade of gray ranging from 0 (pure white) to 255 (pure black)
this is just an introduction to perceptrons, and so we will keep things simple, and for concreteness, and consider just the problem of recognizing the hand-written digit 0
we want to write a program whose input is a small grayscale image, and whose output is either “yes” (it is a 0), or “no” (it’s not a 0)
so the input to our program will be a 28x28 grid of numbers ranging from 0 to 255
- note that 28*28 = 784, so, even more abstractly, we can think of every input image as a vector of 784 numbers
- since each of these 784 numbers ranges from 0 to 255, they can each be represented by a single 8-bit byte, meaning each image takes up 784 bytes of memory
the goal is to write a program to distinguish 0s from not 0s
one way to do this might be to look at some examples, and hand-craft some rules that recognize 0s
but this is not very effective in practice: it is too hard for humans to come up with all the little details necessary for such rules to work
- plus, if you start to scale up to more complex problems, like recognizing letters, images, faces, people on the street, etc., it enormously difficult to hand-craft useful rules
so instead, it’s possible to learn rules for recognizing 0s
the basic idea is to write a program that learns by examining a training set of correctly labelled images
- that is, we will show the program some images that are 0s, and some that are not, and then have the program infer rules for distingushing the two kinds of pictures
- this is an example of supervised learning: learning by looking at correctly labelled examples
Perceptrons¶
Perceptrons are a simple model of human neurons that, while not effective for learning in general, are easy to understand and so make a good first step in understanding machine learning
the image below is a perceptron with 5 inputs and 1 output:
in general, a perceptron can have any number of inputs, but only 1 output
for the 0 recognition problem, there would be 784 inputs, one for each pixel of the image, and the output would be a value corresponding to “yes, it’s a 0” or “no, it’s not a 0”
on each input edge of the perceptron is a weight, and learning these weights is the goal of perceptron learning
we will use the vector notation \(w = [w_1, w_2, \ldots, w_m]\) to represent the edge weights
there is also usually one special weight, call \(b\) for bias
together, \(w\) and \(b\) are the parameters for the perceptron
\(\Phi\) is used to denote these parameters, and \(\phi_i \in \Phi\) is the \(i\)-th parameter; for a perceptron, \(\Phi = \{w \cup b \}\)
the perceptron computes the following function:
in words, the perceptron calculates the weighted sum of the inputs, plus the bias \(b\), and returns 1 if that sum is bigger than 0, and 0 otherwise
the inputs and weights are vectors, and so the perceptron can be nicely represented using the dot product of \(x=[x_1, \ldots, x_l]\) of inputs and the vector \(w=[w_1, \ldots, w_l]\) of weights
so the perceptron function can be written more succinctly as:
for convenience, the bias value \(b\) is often talked about as if it were one of the weights, \(w_0\), and the input connected to it always has the value 1
- the reason for doing this is then all the numbers fit into \(w\), which simplifies the math a little
superscripts are typically used to denote different inputs, i.e. \(x^{k} = [x_{1}^{k}, x_{2}^{k}, \ldots, x_{l}^{k}]\) is image for example \(k\), and \(a^k\) is the correct answer
- so a labelled example is \((x^k, a^k)\)
we assume that we have a large set of correctly labelled images, and we divide this set into training example (that will be used for learning), and testing examples that will be used for judging how well the learning went
- importantly, we must be careful not to use any the evaluation examples in the training, since that would be cheating!
- many people also use a third set of data for development purposes, to tweak the parameters of their algorithm without peeking at the evaluation data
Perception Learning¶
the goal of perceptron learning is to find values for \(w=[w_1, \ldots, w_l\) that correctly classify all the training examples, and also does a good job of correctly classifying images it has never seen before
it turns out that if it is possible for a perceptron to learn a function, then there is a simple algorithm (that we will see in a moment) that is guaranteed to learn it
- unfortunately, perceptrons can learn only a very small set of functions, i.e. functions that are linear separable, and so they are not very useful in practice
here is the perceptron learning algorithm:
for each training vector \(x^k\), normalize it by dividing each \(x_i^k\) by the sum of all the \(x_i\) in \(x^k\) (thus all the values are between 0 and 1)
set \(b\) and all the \(w_i\) values in vector \(w\) to 0
for \(N\) iterations, or until the weights don’t change, do the following:
- randomize the order of the training examples
- for each training example \(x^k\) with answer \(a^k\)
- if \(a^k - f(x^k) = 0\), continue
- otherwise, for all weights \(w_i\), set \(w_i = w_i + \Delta w_i\) where \(\Delta w_i = (a^k - f(x^k))x_i^k\)
line b.1 says that if the perceptron returns the correct value for example \(k\), then nothing is done, i.e. the weights are unchanged
line b.2 says that to weight \(w_i\) we add \((a^k - f(x^k))x_i^k\), i.e. the difference between the correct answers and the perceptron’s output multiplied by \(x_i^k\)
example. suppose we run the example \((x_k, a^k=1)\) (a positive example) through the perceptron learning algorithm, and lets assume the perceptron classifies it incorrectly, i.e. \(f(x^k) = 0\)
then line b.2 calculates \(\Delta w_i = (a^k - f(x^k))x_i^k = (1 - 0)x_i^k = x_i^k\), i.e. it \(w_i\) will be increased
intuitively, this makes sense because the perceptron’s answer was too low, and adding \(x_i^k\) will increase it (all the \(x\) values are positive, because they are grayscale color values)
example. now suppose we run the example \((x_k, a^k=0)\), and the perceptron classifies it incorrectly, i.e. \(f(x^k) = 1\)
then line b.2 calculates \(\Delta w_i = (a^k - f(x^k))x_i^k = (0 - 1)x_i^k = -x_i^k\), i.e. it \(w_i\) will be decreased
the perceptron learning algorithm is guaranteed to find weights that work for all the training examples if they are from a function that the perceptron can represent
- a perceptron can only learn linearly separable functions, i.e. functions whose positive and negative examples can be separated by a hyperplane (i.e. the equivalent of a line in higher dimensions)
- in practice, very few functions are linearly separable
but if the training examples are from a function that isn’t linearly separable, then the algorithm loops forever, and so in that case it’s necessary to stop it after some number of iterations
a single perceptron gives one yes/no answer, and so if want to classify something with more than two possibilities (e.g. classifying hand-written digits from 0 to 9) then we can use one perceptron per classification, e.g.:
e.g. to classify the 10 hand-written digits, we could use 10 perceptrons, one for each digit, and the input to each perceptron would be the 784 pixels of the input image
Beyond Perceptrons¶
it was known by then end of the 1960s that Perceptrons were not an effective classifier for most functions
in the early 1980s, back propagation was discovered to be a much more effective learning method
but it was still quite slow and not always that effective, i.e. neural nets often did not perform as well as other learning techniques
by around 2010, things changed
it was discovered (through a lot of hard work by many researchers!) that multi-layer neural networks could be trained to effectively learn complex functions
- a layer in a feed-forward neural network is a “column” of neurons
- e.g. the input layer is connected directly to the input, and then 0 or more hidden layers are connected after the input layer
- that hard part is figuring out how to train the hidden layers effectively
- another important improvement was the use of high-performance hardware to
run neural net algorithms quickly
- e.g. neural nets need to calculate lots of dot-products, and so hardware (such as graphical processing units, or GPUs) are a good choice for running neural nets because they are optimized for such operations
- libraries like Tensor Flow are designed to automatically take advantage of GPU hardware if your computer has it
there are other kinds of neural nets beyond the feed-forward ones we’ve seen here, but delving into those is beyond the scope of this course