Look at the little group of numbers on the right. They were handwritten and aren't exactly neat and tidy. Take the first four on the third line. Our brain says 1 , 7, 4 , 2. We might bring it together as 1742 and some of you might think of a year - perhaps the year Handel's Messiah was first performed. And we do it with a brain composed of nearly 90 billion neurons each with up to ten thousand connections to other neurons. All of this uses just 20 watts of power and weighs about a kilogram and a half.

The brain's neurons are simple computational objects. There are input 'wires' called dendrites that connect to the cell body and outputs called axons. Signals arrive on the dendrites and the neuron sums them. If they exceed some critical value the neuron "fires", sending a signal out through the axons to other neurons. There's a lot of detail to the nature of the wiring and how it comes to be, but for now think of the neurons as objects that sum inputs and fire if a critical value is exceeded... more abstractly they're just mathematical functions.

The reason for starting off with some numbers is one of the basic programs - the 'hello world' if you will - for machine learning is to build a number recognizer. A very large set of handwritten and printed numbers was scanned by a 28 by 28 pixel sensor giving a 784 pixel image. Each of these pixels has a value depending on the brightness of the number it corresponds to. For the sake of argument say it ranges from 0.00 for black to 1.00 for bright white. A very crisply written 1 would be a line of dark values in a sea of mostly white. The edges would have intermediate grey values. Less crispy numbers have lot of grey regions.

To make our neural network (call it a NN) each one of these numbers is put into a corresponding neuron. The number in the neuron is called the activation. You can think of the neuron as doing something when it's activation is a high number.

These 784 neurons are going to be the first layer of our NN. We'll have 784 neurons on one side representing the brightness of each pixel of the image. On the far side is a small layer of 10 neurons on the other side the numbers 0 through 9. Each of these contains the NN's best guess for how much the scanned number is one of the numbers. For example you might have 0.02, 0.14, 0.38, 0.02, 0.98, 0.22, 0.13, 0.22, 0.33, 0.07 .. the guess that it is the number 1 is 0.02, 2 is 0.14 and so on. In this case 0.98 stands out, so NN is jumping up and down saying 'it's 5".

In between are other layers .. often called 'hidden layers.' Rather than presenting the math that describes the network and waving our hands saying "it's all in the training", let's go in more deeply and see what's going on. The diagram shows a very simple NN intended to recognize numbers with 784 input neurons on the left (it only shows a few... use your imagination), a single hidden larger of 15 neurons and an output layer with 10 neurons..

Several years ago I wrote my first machine learning program to recognize numbers from this data set. I used two hidden layers with 20 neurons each. Their number and sizes are not hard and fast choices - there's a lot of room for experimentation and frankly a lot of this is more art rather than engineering.

Consider of the numbers from the pixels in the first layer. These activations will determine the activations in the first hidden layer which in turn determine the activations in the second hidden layer which determine those in the output layer. A pattern in the first determines a pattern in the second which determines a pattern in the third and the final pattern tells us the guesses for each number.

I like to think about the hidden layers doing something rather than just saying they're a black box. It may be something very foreign to us, but in this case it is possible that one might, say, recognize components like edges and the next may be larger components like lines or roundish structures (a 9 is a line and a roundish structure, a 7 has two lines...). It probably isn't doing this, but it's reasonable to think of a NN breaking down problems into layers of abstraction. Trying to understand what's going on in these hidden layers is very difficult in large problems - perhaps impossible and that becomes an important issue when machine learning is used for purposes that impact us.^{1 }

At this point the post will diverge a bit. I'll go into some but not detail that some folks may find useful. I'll highlight it in blue. Feel free to skip it if you're only interested in a high level view. Up to now we've established a neural network is just a function that takes input on one side and spits out an answer on the other. In this case the input is the brightness values of a 28 by 28 pixel scan of numbers and the output is the NN's best guess for each digit, from 0 to 9, of how likely the input was that number.

Consider the simple, but artificial, example where the second layer is trying to look for a some kind of feature of the number - say certain types of line segments. Each of the 784 neurons is connected each of the 20 neurons in the second layer. Each value in the left most layer is a pixel value. As an exercise to make things clear take out a sheet of paper and make a vertical column of circles labeling them a_{1}, a_{2} , a_{3}, .... up to a_{784}. (only do a few of them ^{ ...} are your friends) On the right draw another column with a couple of circles. Connect each circle on the left with each circle on the right. We multiply each of these activations by a weight - just a number - and find the weighted sum of all of the pixels .. labeling the n^{th} neuron as a_{n} and the n^{th} weight as w_{n} we get: w_{1}a_{1} + w_{2}a_{2} + w_{3}a_{3} + ... + w_{784}a_{784}. In the example of a trying to find a little line segment imagine making all of the weights 0 except for those in the region right around the little segment. Then taking the weighted sum just focuses on the values of the pixels we care about. If we really wanted to force the issue we could add add a negative number called the bias to the sum to require the weighted average be large. The bias surpasses other bits of signal in the vicinity of what we're interested in. The weighted sum going into the neuron on the second row is now (w_{1}a_{1} + w_{2}a_{2} + w_{3}a_{3} + ... + w_{784}a_{784} - b),where b is the bias.

We're getting closer. This new activation for the pixel in the second layer can have quite a range. If the weights are between -1 and 1, it can be between -784 and +784 before adding the bias. It is common to scrunch everything to a range between 0 and 1 using a sigmoid function, call it s(x).^{2} Now the scrunched weighted sum is s(w_{1}a_{1} + w_{2}a_{2} + w_{3}a_{3} + ... + w_{784}a_{784} - b) for each of the 20 neurons in the second layer. This new activation that goes into the neuron in the second column is the crude equivalent of a synapse in biology.

You'll note the naming convention I've used isn't good enough. We're just looking at what goes into a single neuron of the second layer. In this example there are 19 more, each with a connection to each of the 784 neurons of the first row and each connection with it's own weight. And there's another bias. Do this and you have the first two rows. Next you consider the connections between the second and third layers (the two hidden layers in this example). After you finish that you have the connections between the third and final layer. And this is just a simple example. Fortunately computers are really good at this kind of math and there's a natural way to easily organize this. If you get into this but you'll need to be on good speaking terms with linear algebra.^{3}

Before getting into the "magic"of how these weights are found take a look at the complexity of this example. There are (784 * 20) + (20 * 20) + (20 * 10) weights and 20 + 20 + 10 biases. That's 16,300 knobs to adjust correctly - yikes! Imagine what it's like for problems with columns millions of elements long. This approach, and there are many variations, did't take off until computers became powerful enough and we had interesting data - big data - to play with. Although there are a lot of calculations, each is very simple. You don't need a big Intel processor with it's complex instruction set. GPU - graphics processing units - are arrays of thousands of tiny and fairly slow simple processors. Each is something like ten or twenty times slower than a modern Intel CPU, but all of them can run in parallel.. you end up being faster by a factor of a thousand or so. Complex machine learning problems have moved Nvidia beyond computer graphics. It extends to smartphones - Apple's machine learning core on it's latest processors can make this happen in your phone rather than having to rely on the cloud.

Onward to machine learning because we're not daft enough to adjust this by hand.

What we want to do is come up with a recipe - an algorithm - that adjusts those 16,300 knobs so we're likely to get the right answer when we give the machine a scanned number. For that we'll use some big data.. the enormous set of scanned and labeled (what the scan really is: 7 for example) 28 by 28 pixel numbers that's in the public domain.^{4} The data is used to train the network. You start out running some training data, check what the results are, make some adjustments to the weights and biases and keep repeating the process until the result is good enough. Then you test your NN by giving it another set of labeled training data it hasn't seen so you can see if it's still doing a good job. The hope is the layered structure generalizes what the recognizer does beyond the training data. This is critical and whoever implements a NN needs to be aware of the domain(s) where it's valid.

So each neuron is connected to all of the neurons in the previous layer and it's activation is the weighted sum of all of the all of those neurons. You might think of the weights as the strength of each connection and the bias is an indication if that neuron tends to be active or inactive. To start off you set all of the weights and biases completely randomly. It will give terrible results on the first set of training data.

To make it learn you define a cost function. We know what the number should be: 4 for example. Now look at the differences between each final digit and what the machine guessed. Consider the square of the difference between what you expect and what you get. If the original number was 4 we expect the output list to behave like this: 0 to have the value 0, 1 to be 0, 2 to be 0, 3 to be 0, 4 to be 1, 5 to be 0,...

The first column is the number in question, the second column is the expected result of a perfect NN, the third is the square of the difference of the computed expectation and the expected result (the second column) which are shown in the forth column.

0 0 0.3 (0.3 - 0)^{2 } 0.09

1 0 0.2 (0.2 - 0^{)2} 0.04

2 0 0.1 (0.1 - 0)^{2} 0.01

3 0 0.8 (0.8 -0)^{2} 0.64

4 1 0.3 (0.3 - 1)^{2} 0.49

5 0 0.7 (0.7 - 0)^{2} 0.49

6 0 0.3 (0.3 - 0)^{2} 0.09

7 0 0.8 (0.8 - 0)^{2 } 0.64

8 0 0.5 (0.5 -0)^{2} 0.25

9 0 0.0 (0.0 - 0)^{2} 0.00

The cost is the sum of the squares of the differences... in this case 2.74. Ideally we would get 0.00 for the cost...

The cost is small when the image is accurately recognized, large otherwise. . Now look at the average of all of the costs over all of the training data... that gives a measure of how good our NN is for that set of data. A single number for each set of settings.

The trick is just calculus. Our AI is much more artificial than intelligent. I'll just state a method exists to find out how to turn the knobs and when to stop. I'll use a small bit of blue so be ready to skip if you don't know about gradients,

Imagine the case where we just have one variable. There would be some kind of curve for the weight relating that weight to the average cost. If it was very simple, and it's not likely to be, you could just find the minimum by taking the derivative and be done with it. A better tactic is to start somewhere on the curve and figure out which direction to move to lower the output by measuring the slope .. move left if the slope is positive, right if it's negative. You'll end up at a local minimum. Now imagine the function with many inputs - so you have many dimensions. Now you're looking for the direction and steepness of steepest descent .. just the negative of the gradient of the function.

So compute the gradient of the function, take a small step downhill and repeat over and over until you find a local minimum. So put all of the 16,300 weights and biases into a tall column vector. The negative gradient of the cost function of that vector tells which corrections to all of those numbers gives the most rapid decrease of the cost function. Minimizing it means better average performance across all of those training samples.

A useful way of thinking about the negative gradient of the input vector is each element tells us if a weight or bias should be increased or decreased and how important it is. so one that starts out as 0.02, -1.29, 0.33 ... would say w_{1} should decrease a little and it is relatively unimportant, w_{2} should increase a lot and it is important, w_{3} should decrease a somewhat.

This setting of the weights and biases is called backpropagation. We're just minimizing the cost function. It's important that this function is smooth so we can find local minimums... that's why the activation numbers of the neurons have a smooth range rather than something more discrete like ones and zeros. Oddly enough this is where the brain is more digital than the neural network.

Playing with this example I was able to recognize a set of number images I hadn't trained the NN on at about 94% accuracy. There are more sophisticated approaches. It is important to stress that the innards get very messy and difficult to understand. That's an unavoidable bug as it is very easy to fool ourselves. There's bias in the building of the neural network, bias in the training data and bias on it's application. It is great for some applications, but when it's used in science (I have some experience with it in astrophysics), you use it as a last resort and try very hard to keep the models simple and understandable. The opposite of what happens in machine learning on social networks.

I think they're a fantastic way to approach certain classes of otherwise intractable problems and potentially misleading and even dangerous in other domains.

__________

^{1} Like associating aspects of our life and the games we play with our likelihood of us being a good engineer or a neo-Nazi. These biases, some on purpose and others by accident are extremely important and worth taking up separately. Machine learning tends to be good at identifying things very similar to known sets. There be dragons and caution is required.

^{2} For example s(x) = 1/(1 + e^{-x}). There are a variety of functions of this class. This one is known as the logistic function - you pick what's appropriate for the problem at hand. In this case large negative numbers go to 0 and large positive numbers go to 1.

^{3} Organize the input activations as a column vector and the weights as a matrix where the row corresponds to the connections between one layer and a particular neuron in the next layer.

^{ 4} the MNIST Database from LeCun, Cortes, and Burges.

__________

The first good brussels sprouts are out.

**Roasted Brussels Sprouts with Mustard**

**Ingredients **

° 1 pound brussels sprouts halved

° 6 tbl olive oil

° kosher salt

° 1 tbl dijon mustard

° 1 tbl coarse grain mustard

° 1 tbl cider vinegar

**Technique**

° oven to 425°F

° put brussels sprouts on a baking sheet. drizzle on half of the olive oil and some salt and stir them (I use my hands). Roast on middle rack for about 35 - 45 min turning them a few times until softened and browned.

° whisk the mustards, olive oil and vinegar into a dressing and add a bit of salt to taste.

° put the baked brussels sprouts in a bowl and toss with the dressing.