Look at the little group of numbers on the right. They were handwritten and aren't exactly neat and tidy. Take the first four on the third line. Our brain says 1 , 7, 4 , 2. We might bring it together as 1742 and some of you might think of a year - perhaps the year Handel's Messiah was first performed. And we do it with a brain composed of nearly 90 billion neurons each with up to ten thousand connections to other neurons. All of this uses just 20 watts of power and weighs about a kilogram and a half.

The brain's neurons are simple computational objects. There are input 'wires' called dendrites that connect to the cell body and outputs called axons. Signals arrive on the dendrites and the neuron sums them. If they exceed some critical value the neuron "fires", sending a signal out through the axons to other neurons. There's a lot of detail to the nature of the wiring and how it comes to be, but for now think of the neurons as objects that sum inputs and fire if a critical value is exceeded... more abstractly they're just mathematical functions.

The reason for starting off with some numbers is one of the basic programs - the 'hello world' if you will - for machine learning is to build a number recognizer. A very large set of handwritten and printed numbers was scanned by a 28 by 28 pixel sensor giving a 784 pixel image. Each of these pixels has a value depending on the brightness of the number it corresponds to. For the sake of argument say it ranges from 0.00 for black to 1.00 for bright white. A very crisply written 1 would be a line of dark values in a sea of mostly white. The edges would have intermediate grey values. Less crispy numbers have lot of grey regions.

To make our neural network (call it a NN) each one of these numbers is put into a corresponding neuron. The number in the neuron is called the activation. You can think of the neuron as doing something when it's activation is a high number.

These 784 neurons are going to be the first layer of our NN. We'll have 784 neurons on one side representing the brightness of each pixel of the image. On the far side is a small layer of 10 neurons on the other side the numbers 0 through 9. Each of these contains the NN's best guess for how much the scanned number is one of the numbers. For example you might have 0.02, 0.14, 0.38, 0.02, 0.98, 0.22, 0.13, 0.22, 0.33, 0.07 .. the guess that it is the number 1 is 0.02, 2 is 0.14 and so on. In this case 0.98 stands out, so NN is jumping up and down saying 'it's 5".

In between are other layers .. often called 'hidden layers.' Rather than presenting the math that describes the network and waving our hands saying "it's all in the training", let's go in more deeply and see what's going on. The diagram shows a very simple NN intended to recognize numbers with 784 input neurons on the left (it only shows a few... use your imagination), a single hidden larger of 15 neurons and an output layer with 10 neurons..

Several years ago I wrote my first machine learning program to recognize numbers from this data set. I used two hidden layers with 20 neurons each. Their number and sizes are not hard and fast choices - there's a lot of room for experimentation and frankly a lot of this is more art rather than engineering.

Consider of the numbers from the pixels in the first layer. These activations will determine the activations in the first hidden layer which in turn determine the activations in the second hidden layer which determine those in the output layer. A pattern in the first determines a pattern in the second which determines a pattern in the third and the final pattern tells us the guesses for each number.

I like to think about the hidden layers doing something rather than just saying they're a black box. It may be something very foreign to us, but in this case it is possible that one might, say, recognize components like edges and the next may be larger components like lines or roundish structures (a 9 is a line and a roundish structure, a 7 has two lines...). It probably isn't doing this, but it's reasonable to think of a NN breaking down problems into layers of abstraction. Trying to understand what's going on in these hidden layers is very difficult in large problems - perhaps impossible and that becomes an important issue when machine learning is used for purposes that impact us.^{1 }

At this point the post will diverge a bit. I'll go into some but not detail that some folks may find useful. I'll highlight it in blue. Feel free to skip it if you're only interested in a high level view. Up to now we've established a neural network is just a function that takes input on one side and spits out an answer on the other. In this case the input is the brightness values of a 28 by 28 pixel scan of numbers and the output is the NN's best guess for each digit, from 0 to 9, of how likely the input was that number.

Consider the simple, but artificial, example where the second layer is trying to look for a some kind of feature of the number - say certain types of line segments. Each of the 784 neurons is connected each of the 20 neurons in the second layer. Each value in the left most layer is a pixel value. As an exercise to make things clear take out a sheet of paper and make a vertical column of circles labeling them a_{1}, a_{2} , a_{3}, .... up to a_{784}. (only do a few of them ^{ ...} are your friends) On the right draw another column with a couple of circles. Connect each circle on the left with each circle on the right. We multiply each of these activations by a weight - just a number - and find the weighted sum of all of the pixels .. labeling the n^{th} neuron as a_{n} and the n^{th} weight as w_{n} we get: w_{1}a_{1} + w_{2}a_{2} + w_{3}a_{3} + ... + w_{784}a_{784}. In the example of a trying to find a little line segment imagine making all of the weights 0 except for those in the region right around the little segment. Then taking the weighted sum just focuses on the values of the pixels we care about. If we really wanted to force the issue we could add add a negative number called the bias to the sum to require the weighted average be large. The bias surpasses other bits of signal in the vicinity of what we're interested in. The weighted sum going into the neuron on the second row is now (w_{1}a_{1} + w_{2}a_{2} + w_{3}a_{3} + ... + w_{784}a_{784} - b),where b is the bias.

We're getting closer. This new activation for the pixel in the second layer can have quite a range. If the weights are between -1 and 1, it can be between -784 and +784 before adding the bias. It is common to scrunch everything to a range between 0 and 1 using a sigmoid function, call it s(x).^{2} Now the scrunched weighted sum is s(w_{1}a_{1} + w_{2}a_{2} + w_{3}a_{3} + ... + w_{784}a_{784} - b) for each of the 20 neurons in the second layer. This new activation that goes into the neuron in the second column is the crude equivalent of a synapse in biology.

You'll note the naming convention I've used isn't good enough. We're just looking at what goes into a single neuron of the second layer. In this example there are 19 more, each with a connection to each of the 784 neurons of the first row and each connection with it's own weight. And there's another bias. Do this and you have the first two rows. Next you consider the connections between the second and third layers (the two hidden layers in this example). After you finish that you have the connections between the third and final layer. And this is just a simple example. Fortunately computers are really good at this kind of math and there's a natural way to easily organize this. If you get into this but you'll need to be on good speaking terms with linear algebra.^{3}

Before getting into the "magic"of how these weights are found take a look at the complexity of this example. There are (784 * 20) + (20 * 20) + (20 * 10) weights and 20 + 20 + 10 biases. That's 16,300 knobs to adjust correctly - yikes! Imagine what it's like for problems with columns millions of elements long. This approach, and there are many variations, did't take off until computers became powerful enough and we had interesting data - big data - to play with. Although there are a lot of calculations, each is very simple. You don't need a big Intel processor with it's complex instruction set. GPU - graphics processing units - are arrays of thousands of tiny and fairly slow simple processors. Each is something like ten or twenty times slower than a modern Intel CPU, but all of them can run in parallel.. you end up being faster by a factor of a thousand or so. Complex machine learning problems have moved Nvidia beyond computer graphics. It extends to smartphones - Apple's machine learning core on it's latest processors can make this happen in your phone rather than having to rely on the cloud.

Onward to machine learning because we're not daft enough to adjust this by hand.

What we want to do is come up with a recipe - an algorithm - that adjusts those 16,300 knobs so we're likely to get the right answer when we give the machine a scanned number. For that we'll use some big data.. the enormous set of scanned and labeled (what the scan really is: 7 for example) 28 by 28 pixel numbers that's in the public domain.^{4} The data is used to train the network. You start out running some training data, check what the results are, make some adjustments to the weights and biases and keep repeating the process until the result is good enough. Then you test your NN by giving it another set of labeled training data it hasn't seen so you can see if it's still doing a good job. The hope is the layered structure generalizes what the recognizer does beyond the training data. This is critical and whoever implements a NN needs to be aware of the domain(s) where it's valid.

So each neuron is connected to all of the neurons in the previous layer and it's activation is the weighted sum of all of the all of those neurons. You might think of the weights as the strength of each connection and the bias is an indication if that neuron tends to be active or inactive. To start off you set all of the weights and biases completely randomly. It will give terrible results on the first set of training data.

To make it learn you define a cost function. We know what the number should be: 4 for example. Now look at the differences between each final digit and what the machine guessed. Consider the square of the difference between what you expect and what you get. If the original number was 4 we expect the output list to behave like this: 0 to have the value 0, 1 to be 0, 2 to be 0, 3 to be 0, 4 to be 1, 5 to be 0,...

The first column is the number in question, the second column is the expected result of a perfect NN, the third is the square of the difference of the computed expectation and the expected result (the second column) which are shown in the forth column.

0 0 0.3 (0.3 - 0)^{2 } 0.09

1 0 0.2 (0.2 - 0^{)2} 0.04

2 0 0.1 (0.1 - 0)^{2} 0.01

3 0 0.8 (0.8 -0)^{2} 0.64

4 1 0.3 (0.3 - 1)^{2} 0.49

5 0 0.7 (0.7 - 0)^{2} 0.49

6 0 0.3 (0.3 - 0)^{2} 0.09

7 0 0.8 (0.8 - 0)^{2 } 0.64

8 0 0.5 (0.5 -0)^{2} 0.25

9 0 0.0 (0.0 - 0)^{2} 0.00

The cost is the sum of the squares of the differences... in this case 2.74. Ideally we would get 0.00 for the cost...

The cost is small when the image is accurately recognized, large otherwise. . Now look at the average of all of the costs over all of the training data... that gives a measure of how good our NN is for that set of data. A single number for each set of settings.

The trick is just calculus. Our AI is much more artificial than intelligent. I'll just state a method exists to find out how to turn the knobs and when to stop. I'll use a small bit of blue so be ready to skip if you don't know about gradients,

Imagine the case where we just have one variable. There would be some kind of curve for the weight relating that weight to the average cost. If it was very simple, and it's not likely to be, you could just find the minimum by taking the derivative and be done with it. A better tactic is to start somewhere on the curve and figure out which direction to move to lower the output by measuring the slope .. move left if the slope is positive, right if it's negative. You'll end up at a local minimum. Now imagine the function with many inputs - so you have many dimensions. Now you're looking for the direction and steepness of steepest descent .. just the negative of the gradient of the function.

So compute the gradient of the function, take a small step downhill and repeat over and over until you find a local minimum. So put all of the 16,300 weights and biases into a tall column vector. The negative gradient of the cost function of that vector tells which corrections to all of those numbers gives the most rapid decrease of the cost function. Minimizing it means better average performance across all of those training samples.

A useful way of thinking about the negative gradient of the input vector is each element tells us if a weight or bias should be increased or decreased and how important it is. so one that starts out as 0.02, -1.29, 0.33 ... would say w_{1} should decrease a little and it is relatively unimportant, w_{2} should increase a lot and it is important, w_{3} should decrease a somewhat.

This setting of the weights and biases is called backpropagation. We're just minimizing the cost function. It's important that this function is smooth so we can find local minimums... that's why the activation numbers of the neurons have a smooth range rather than something more discrete like ones and zeros. Oddly enough this is where the brain is more digital than the neural network.

Playing with this example I was able to recognize a set of number images I hadn't trained the NN on at about 94% accuracy. There are more sophisticated approaches. It is important to stress that the innards get very messy and difficult to understand. That's an unavoidable bug as it is very easy to fool ourselves. There's bias in the building of the neural network, bias in the training data and bias on it's application. It is great for some applications, but when it's used in science (I have some experience with it in astrophysics), you use it as a last resort and try very hard to keep the models simple and understandable. The opposite of what happens in machine learning on social networks.

I think they're a fantastic way to approach certain classes of otherwise intractable problems and potentially misleading and even dangerous in other domains.

__________

^{1} Like associating aspects of our life and the games we play with our likelihood of us being a good engineer or a neo-Nazi. These biases, some on purpose and others by accident are extremely important and worth taking up separately. Machine learning tends to be good at identifying things very similar to known sets. There be dragons and caution is required.

^{2} For example s(x) = 1/(1 + e^{-x}). There are a variety of functions of this class. This one is known as the logistic function - you pick what's appropriate for the problem at hand. In this case large negative numbers go to 0 and large positive numbers go to 1.

^{3} Organize the input activations as a column vector and the weights as a matrix where the row corresponds to the connections between one layer and a particular neuron in the next layer.

^{ 4} the MNIST Database from LeCun, Cortes, and Burges.

__________

The first good brussels sprouts are out.

**Roasted Brussels Sprouts with Mustard**

**Ingredients **

° 1 pound brussels sprouts halved

° 6 tbl olive oil

° kosher salt

° 1 tbl dijon mustard

° 1 tbl coarse grain mustard

° 1 tbl cider vinegar

**Technique**

° oven to 425°F

° put brussels sprouts on a baking sheet. drizzle on half of the olive oil and some salt and stir them (I use my hands). Roast on middle rack for about 35 - 45 min turning them a few times until softened and browned.

° whisk the mustards, olive oil and vinegar into a dressing and add a bit of salt to taste.

° put the baked brussels sprouts in a bowl and toss with the dressing.

## precision twinklers

Fifty years ago this month Jocelyn Bell was going through the chart recorder output from a new radio telescope she and a few others had built at Cambridge . She had worked herself up to about a hundred feet of it a day looking for anything interesting in the squiggly line of ink. Very boring work. Lots of noise and bits of the expected.

But there was this "bit of scruff" when the telescope was pointed in a particular direction. The scruff was repeating every 1.3 seconds. Very regularly.

^{1 }You always doubt your equipment. Regular noise is usually something man-made or an artifact from the apparatus. And there were those new fangled satellites beeping away,. Carefully you try to eliminate the possibilities. She and her advisor eliminated most possibilities, but they weren't

thatcertain. Then she found another bit of scruff. That regular pulsing .. now at a different rate, but still very accurate. Better than the clocks they had in the lab. She had found something - something outthere.There was a flurry of activity when they announced. Astronomers around the world dropped what they were doing. Theoretical physicists and astrophysicists conjectured. A few of the conjectures advanced to hypotheses. And in this swirl of activity she found and third ... and then a forth. And in theoryland of the hypotheses made sense. It was called a pulsar.

She had found the crack in the door to some of the deepest Nature yet encountered.

In 2017 something like two thousand have been found. Shortly after the discovery it was suggested a star could collapse so dramatically that it's core was nothing but neutrons. Imagine a mass greater than the Sun compressed down to something about a dozen kilometers in diameter. Like a skater pulling in her arms it spins much faster than the star it started out as. The first went around every 1.3 seconds, but some spin more than a thousand times a second. Their magnetic fields are trillions of times stronger than the Earth's and, combined with the rotation a beacon-like signal forms. It sweeps through the sky like the light from a lighthouse as it rotates.

She and her advisor opened up an entirely new and unexpected branch of astrophysics that has led to a much deeper understanding of both astronomy and physics. Pulsars and their close relatives are hot areas of research. And they can even be useful. A GPS system is just a group of accurate clocks with transmitters orbiting the Earth. Pulsars are as accurate as atomic clocks. You could use them to build a galactic positioning system that would work anywhere in the Milky Way and not than just on Earth. And you can make use of their regular beat to build another kind of gravity wave detector. One that is complementary to the current interferometry technique.

Some of the techniques developed since then have trickled down into important technology we use and many hundreds of astronomers and astrophysicists have spent some time in the private sector making their contributions to the economy. Pure science is a very inexpensive mechanism for creating future value. The "problem" is you can't predict where it might lead.

Hewish went on to receive a Nobel Prize in Physics for the discovery. It is widely felt in the astronomy community that Jocelyn Bell should have shared in the discovery as her contributions were enormous. But: 1967 and female. She spoke of it ten years after the discovery:

"demarcation disputes between supervisor and student are always difficult, probably impossible to resolve. Secondly, it is the supervisor who has the final responsibility for the success or failure of the project. We hear of cases where a supervisor blames his student for a failure, but we know that it is largely the fault of the supervisor. It seems only fair to me that he should benefit from the successes, too. Thirdly, I believe it would demean Nobel Prizes if they were awarded to research students, except in very exceptional cases, and I do not believe this is one of them. Finally, I am not myself upset about it – after all, I am in good company, am I not!"

She's being far too generous. This

wasone of those exceptional cases. At least she's received other significant honors she's had throughout her career.____

And something sad.

I rarely use the term genius to describe anyone alive. Maryam Mirzakhani was an exception. A fearless mathematician, she was the only women to receive the Field Medal: math's highest recognition. She died on Saturday at age forty. She was just warming up. Here's a well-written piece about her that appeared in

Quantaa few years ago.Excuse my political comment, but Mirzakhani was female, brilliant and from an Islamic county. I doubt someone with these "liabilities" would be welcome here now ...

__________

^{1}She recognized that pulsars are astronomical sources where others had failed because she noticed that the pulses in her data (Figure 6.1) didn’t look like other forms of interference and they reappeared exactly once per sidereal day, indicating an origin outside the Solar System. She and Hewish, “decided initially not to computerize the output because until we were familiar with the behavior of our telescope and receivers we thought it better to inspect the data visually, and because a human can recognize signals of different character whereas it is difficult to program a computer to do so.” Other people were using software to filter out noise and were throwing out the interesting signal in the process..Posted at 06:05 PM in general comments, history of science, math | Permalink | Comments (0)

| Reblog (0)