Passing white light through a prism to break it into a range of colored light, Newton introduced the word spectrum into optics. The colored apparition was packed with information contained in the light - a mixture of frequencies and intensities. Eventually it was learned that studying these revealed important aspects of matter the light had interacted with. In the mid 1800s physicists reignited the sleepy field of astronomy by attached spectroscopes to telescopes allowing the study of what made the starlight. Among the discoveries was an unknown yellow line in the Sun's spectrum. It was proposed as a possible new element later called Helium in honor of the Sun. 25 years it was discovered on Earth.
The technical is powerful and commonly used in many areas of science, medicine with thousands of commercial applications. The caveat is you need to use the right technique and appropriate instrument.
The heart of many types of spectrometers is simple - a well characterized and calibrated light source, something to hold a sample, a prism or diffraction grating to break up the light into a spectrum and a way to measure the intensity of the spectrum as a function of its wavelength (color, sort of). A smartphone has enough processing power to perform serious analysis and a few companies have made low cost spectrometers for teaching purposes. A good example is this $400 unit from Pasco. It lacks the precision to do serious work, but its great fun and works with an iPhone or iPad.
Three times in the past two years people have sent descriptions and pitches of even more portable units using integrated spectrometer modules. While they might be fun, these usually do reflected light surface spectroscopy and appear to have poor resolution. Silly claims are frequently made - that you can determine what is in your medicine, do food safety and calculate the calories in a meal.
Nope - it isn't going to work like that. You can't use out of context uncalibrated data and somehow perform cloud magic with 'algorithms' ... Success during a Kickstarter funding period does not guarantee impossible magic. There will be a flood of interesting sensors and even serious instrumentation for smartphones and some of it will enable a density of information that is potentially transformative. Consider thousands of accurate pollution monitors per square kilometer in cities. China is ripe for a revolt from a data informed citizenry.
Data, even the simplest data, has context. What was the sample. how were the measurements made, what is the accuracy of the measuring tool, did the accuracy change over time, how was the data preprocessed, how was it postprocessed...?
A few years ago I came across a detailed description of a rope transmission from Bordeaux, France that moved mechanical power from a water wheel to several businesses hundreds of feet away. These contraptions sometimes often had mechanical efficiencies that still impress. My answer seemed far too low. I smugly came to the conclusion that the book must be in error.
The problem resurfaced recently when I was reading about the spread of the metric system in France. Standardization of weights and measures was of vital importance to commerce as hundreds of local systems were being used. The author noted rural Bordeaux defined a foot using a standard that was 35.7cm long - dramatically different from the now standard definition of about 30. 5cm. I had been doing my calculations based on a bad assumption. Using this definition a six foot tall person would be a bit over five foot one...
About three years ago a friend came to me with a question about a distribution of measurements. She's quite tall and was thinking about putting together a specialized shop for tall women. An important early question was to understand the potential market size. While there are many factors that go into clothing size, height is a reasonable proxy to get a rough idea. She knew there were tables showing height distributions, but it wasn't clear to her how to navigate the information. I had been involved in a large anthropometry survey in the nineties so one thing led to another...
You can make a histogram of heights - 100 people at 64 inches, 125 at 65 inches and so on. Increasing the number of people in your sample will make the distribution look smoother. With a very large population and very small measurement increments you'll get something like this:
It isn't exactly the bell-shaped curve you probably expected. The problem is it is the composite of two two curves. Men are about five inches taller than women on average and need to be considered a separate population. Plotting men and women separately gives two bell curves. Add them and the original distribution appears.
This type of curve is a Normal distribution - anyone with a STEM background has studied them in depth as they turn up almost everywhere. Human height turns out to approximately follow this type of distribution making it easy to calculate the distribution of potential customer heights assuming the published curve is relevant to your population. Measurements have been made in most countries and are repeated at regular intervals.1
My friend's target market was women who stand more than six feet tall. A small market, but she and her friends were underserved so it made sense to investigate. Perhaps the Internet would allow her to find enough customers to justify production runs greater than a few pieces. She chose a huge American survey and found wrote a simple program that calculated how many women were predicated to be at or about a height
from her email:
Something is broken Steve I just integrate a normal distribution with a mean of 64.3 and a standard deviation of 2.5 from a height through infinity to get the fraction at and above a height. Here are my numbers with a little rounding. Any ideas?!?
72" 1 in 966
73" 1 in 3,900
74" 1 in 19,150
75" 1 in 107,000
76" 1 in 697,200
77" 1 in 5,298,900
78" 1 in 47,022,800
This is SO WRONG! The number of girls in college basketball and volleyball are instant disproofs!
To produce a clean fit it standard practice to exclude the tails of a measured distribution and tell the reader where it is valid. It wasn't published in this case, at least not prominently, but the fit was only good to about 5'11. The tails on the real curve tend to be thicker, but making a large enough measurement to smooth them out is unrealistic. It is possible to get a good enough fit for all but a tiny fraction of the population with only 100,000 or so people.
Now I was curious. Would it be possible to find a better fit to the distribution? Over a period of a few months I found a few ways to build a more accurate distribution function. Rather than a simple normal distribution it is the sum of three separate distributions. The tricky part was calculating the errors.
I won't list all of the results, but I'm 95% confident that a 6'3" woman (my friend's height) is between 1 in 16,070 and 18,070 with 1 in 16,950 being the most likely.2 1 in 17,000 is probably a good enough starting point and better than any of the major surveys. My tall friend Colleen would be about 1 in 267,000, but the bound is much larger between 1 in 219,000 and 1 in 420,000. Another good friend is 6'8, the estimates are very sketchy at her level: 1 in 1.38 million with a range from 1 in .76 million to 1 in 2.79 million. Counting athletes as a lower bound is probably better:-)
The final plot looks identical to the second plot as the size of the change is only 0.02% of the regular normal curve. This is a feature as the densely populated portion of the distribution function is nearly identical to the original. To get a sense of the shape of the adjusted function, imagine a world where my corrections are 2,500 times larger making their contribution equal to that of the normal female height distribution.
The new female probability curve is red and the old uncorrected female curve is blue in the first plot and the uncorrected male curve is blue in the second.3 In this different world a quarter of the female population is six feet tall or more. One in one hundred and thirty would be at least Colleen's height. This alternate world could be fodder for social commentary fiction. Tall women certainly wouldn't have problems finding clothing and female sports would probably dominate. Perhaps height would be unimportant socially.
Continuous distribution functions derived from observed data are common and, when used correctly, powerful. It is all too easy to miss a fundamental assumption and come to the wrong result while feeling good about it. Separate sanity checks, like looking at the roster of college volleyball teams, need to be normal practice. Being playful and not be easily seduced is a requirement.
The clothing concept turned out to be impractical. Quality sources for reasonable costs simply don't exist for the necessary low volumes. It became clear that made to measure is necessary. Some is done for men's shirts in Asia and South Asia, but costs are currently high. In addition to height, there was too much variability in sizes to make the effort practical... at least for now. The trick is mass made-to-measure. Eventually it will be standard practice and fit will become nothing more than a memory. Until then some segments of the apparel market are tough sledding. Her advice for anyone dealt an usual body: learn to sew, find a good tailor and make enough money for a few quality custom pieces. Find your own style rather than following the trends.
1 Health organizations usually take the measurements as they are a proxy for human health. Cheating I copy and paste from an email I sent on the subject:
We tend to pay a lot of attention to numbers without worrying about their accuracy. Height is a good example. If you measure a large number of adult males and females, their heights will distribute themselves along a bell shaped curve. Most people will be close to an average (a mean). The larger the sample the more the distribution will take the shape of a bell shaped curve called the normal distribution. Remarkably all you need to know is the average and a number that tells you something about how rapidly the curve changes shape. From this a designer will know the height distribution of potential customers and can adjust their manufacturing plans accordingly. You can do similar exercises for many measurements - shoe size for example.
The characteristic curves vary regionally. The US is very similar to England and can be taken as the same within the accuracy of most measurements. Northern Europe is taller, with Holland being the tallest. The North of Holland is particularly tall and is sometimes treated separately. Asia and South Asia are shorter than the US and South America is much shorter. Changes in mean and standard deviation are studied over time as height is a good proxy for health within a population. A tall person is not necessarily healthier than a short person, but for a child about 20% to 25% of the variation of their final height is influenced by health and nutrition rather than genetics. If the health of a nation changes, so will it’s average height. Holland has seen a dramatic change in the 20th century - particularly after WWII with Northern Holland being the tallest region in the World. For adults under 35 female height is only an inch less than mean male height in the US and mean male height approaches 6'2.
2 A quick comment - the curves are probability distribution functions. The horizontal axis is height, the the vertical axis is probability, so you get a sense of roughly what it is. The exact probability for a point is meaningless. The probability of someone being 73.0000001 is very nearly zero. Just integrate the function over the range you are interested in.
Measurement errors in the height studies and insufficient statistics are the major components that make up the errors. A trained person using a simple stadiometer - those vertical rulers with a sliding horizontal arm - can repeatedly measure height to about a half centimeter, but some surveys are self-reported.
3 Here I have plotted real probabilities for each inch of height and have connected them with a smooth curve - a similar to the density function, but a bit different. I'm just doing it to show there are different ways to interpret and display the data and drawing conclusions requires some understanding.
Here is the probability plot of just the correction in black scaled up to represent a full population next to a normal female plot in red. The shape isn't simple and has a much larger mean and width than the normal curve for women's height. Scaled down by a factor of 2,500 and added to the normal women's distribution it gives a "fat tail" that squares with other measurements.
Playing games like this make it easy to spot errors in your thinking or calculations. At some level this is just a game. A good thing as you can spend a lot of time getting everything to work.
This is modified from an old recipe from a cooking club I was part of
Indian Cauliflower with Several Spices
° 1 large head cauliflower, trimmed and separated into florets
° 1 large yellow onion, thinly sliced
° 1 tbsp olive oil (you really should use coconut oil, but I didn't have any)
° 1 tsp mustard seeds
° 1 tsp cumin seeds
° 1 tsp nigella seeds (I didn't have so I skipped)
° 1 tsp fenugreek seeds
° 1 tsp fennel seeds
° 6 ounces tofu, stirred 'til smooth
° 1 tsp paprika
° ½ tsp cayenne
° ½ tsp turmeric
° juice of a half lemon
° kosher salt to taste
° heat the oil in a pan and add the seeds. When they dance and begin to darken, add the onions.
° sauté the onions until translucent.
° add the cayenne, paprika and salt.
° add the cauliflower florets and mix well.
° add the tofu, mix. Cover the pan and let the cauliflower cook until somewhat tender. Stir every few minutes
° add salt and lemon juice and mix .
° serve warm