We finally had a break in the weather and I was able to spend a lot of time in the woods yesterday. The Sun angle was noticeably higher and the snowy landscape was criss-crossed the the tracks of at least a dozen different kinds of animals. The bonus that made it more than worthwhile was seeing our local coywolf. I hadn't seen her in a month and was worried, but she looked great. Soon the sap would start flowing in the maples. The Winter has been harsh of late but there was finally a sense of a stirring. Tolkien had it right for climates like this - spring, summer, autumn, fading, winter, stirring - six seasons.1
Today is a cross-quarter day - the halfway point between the Winter Solstice and the Spring Equinox. When people were more connected with the seasons a variety of celebrations marked the day. I'm mostly of Irish descent where the day is Imbloc or Brighid's Day - a stirring, but often an unsettled one.
Unsettled. As I write it is still snowing after having put down eight inches. The rest of the week promises no relief. Of course the climate denier crowd is crowing about this disproving global warming, but much of the West Coast, Alaska and a large piece of Russia is amazingly warm. Weather is not climate and one must look for trends. The always marvelous xkcd said it well a few days ago.
To understand climate change you need to take lot of measurements over a long period of time around the world and perform extremely careful (and clever!) analysis. People engaged in the research have to worry about errors at every step of the process and results are stated with conservative confidence levels. Checks come from other groups studying the same phenomena using approaches that may be similar or different, but in the end Nature has the final say. Such is science.
Scientists, particularly those in the physical sciences, are well versed in statistics and have developed many, perhaps most, of the analytical techniques. New techniques have come as the scale of the task has increased along with visualization tools to understand and communicate the results. It might be tempting to call this data science, but it isn't.
When I joined Bell Labs in the early 80s I went to more than a few seminars outside my field of study. A lot was going on, it was Bell Labs during the glory years after all, but it was particularly interesting to watch the development of new machine learning techniques for pattern recognition and classification. Some of them were far removed from statistics and others had bits and snatches. Horror of horrors, error propagation was largely ignored and even impossible in some cases. There was often a black box of sorts that required some tuning or training in order for these algorithms to work properly. Voice recognition, patterns in phone records, sound compression and many other tasks. I got over the initial ugliness and soon found myself joining in - types of cluster analysis - in a project to detect and classify defects in integrated circuit photomasks and later trying to understand the underpinning of the AT&T call recording system.
A brilliant feature of Research at AT&T was a focus on tool making. John Chambers' department came up with a statistics language called S that later became New S and was integrated with Unix. It bristled with statistical techniques along with the emerging non-statistical techniques that were the domain of computer science and became a playground for exploratory and formal analysis. In the end it became the basis for R - a great open source tool from outside the Labs.
R revolutionized the sport and came at the right time as large amount of Internet data were becoming available inside and outside corporate firewalls. One of its beauties was people began to attach the R analysis source with results in papers and a new generation came to speed.
Now the term data science is popular. I'm not keen on the term as it certainly isn't a science. For me it represents an emerging category of analysts who do not specialize in the statistical analysis of the physical sciences or the black box classifications from computer science, but rather have some competence in both areas.
A few weeks ago I was asked to give a talk to students taking an intro to data science course at a local university. I'm afraid I tend not to make view graphs and winged it at the white board. Sorry for the terseness but in the interest of time I'll attach part of a letter I sent to a friend with a mutual interest in the subject. This is properly a very lengthy subject.
My goal was to give a flavor of what I think data scientist means these days and its historical roots as well as a few examples of approaches. I didn’t have time to get to some fun ones.
- Data science is a hybrid of statistics and CS.
- terminology differences can be very confusing .. “model" as an example
- the statistical approach is for people who work in uncertainty. confidence levels and extreme confidence in data, metadata and analytical techniques are paramount. Science wouldn’t work without this approach and it is my bias, but I sometimes walk on the other side
- CS folks are into things like classification and machine learning. Confidence levels are often undefined and sometimes data quality isn’t optimal (but that can get you into trouble). Yahoo, Google, Facebook, Twitter, etc all lean in this direction. It is the basis for many production systems at scale.
- in theory a data scientist has an understanding of both sides, but in practice are mostly on the ML/CS side these days .. at least in the business world
- there is a lot of hype that comes from misunderstanding. DS, like CS, isn’t a science. DS is more of a craft and usually demands a combination of skills
° no one person has all of these skills, so there is usually a team. Lone wolf DSs usually have a few extremely strong areas and consult with existing groups
°My list of necessary skills:
statistics, machine learning, math (at an engineering level - not math major math), data-viz, domain expertise … also there should be communication and people skills
domain expertise is critical as we will see, also communication/people skills: these two have caused the majority of failures I’ve seen
- the importance of data quality, I have never seen a data set that didn’t require munging. Simple things like time stamps can be very difficult
- scaling can be and is a huge issue. Hadoop, etc are useful but way overhyped. We are at fundamental limits with the statistical approach, but confidence in science driven requirements will force invention. It always has. (the synoptic telescope for example)
- some notes on data - there is bias in the data and the analysis, quantify it if you can (statistical approach), learn to live with it. Complete data sets aren’t complete!!! not even close!
- EDA: exploratory data analysis. the S language and now R and extensions to python.
- a few examples of common algorithms (you have to know a lot and have the taste to know when/where to use them!):
linear regression (long discussion.. lots of blank states as if they’d never seen it) core to the statistical/scientific approach
k- NN (k nearest neighbors) - simple machine learning classification. The concept of distance. Why it is easier than statistical techniques for classification. The art of training and tuning. The need to get good enough
k-means - unsupervised machine learning … convergence and interpretation can be major issues
° underscore that ML is a black box with tuning approach whereas statistical approaches demand knowing what is going on during the analysis. They serve very different needs.
____
that’s as far as I got and I took up well over an hour. I really wanted to talk about Bayes theorem as so many ML guys worship it (it is useful for certain classes of problems and inappropriate for many others)
I've glossed over a lot of important points and I'll be happy to go into detail. I should stress data-viz can be very dangerous as it is very easy to paint a seductive story. I've seen terrible failures in organizations without domain expertise built into the analysis group and in a lack of communication between the analysts and their customers. There is a real danger of using prepackaged easy-to-use systems that give beautiful results. While there is a lot of power under the hood you can end up in a ditch if you lack the experience.
The CS machine learning techniques usually don't have statistical underpinnings, but are good enough for quick production work and can scale (although there are some great research issues on going further). The statistical approach used in science has some very serious scaling issues that need to be solved. I expect see solutions emerge from the huge scientific data hoses like Large Synoptic Survey Telescope and the mind wobbling Square Kilometer Array.2
In short data science is a bridge - an isthmus - between the large continents of physical science statistics and more black boxish techniques from computer science. It is still emerging and represents a combination of skills that will be found in no one person. There is a lot of art to it and, at least the business focus, tends to work on problems where careful attention to confidence levels is less important than getting a good enough result and evolving along with the problems at hand. But in the end there is a similar requirement to that of the physical sciences - you do need to understand your datasets and you need to ask the right questions.
__________
1 In The Lord of the Rings the Elven calendar had six months. There is a theme of sixes and twelves among the elves, but six seasons seems appropriate for the climate in Northeastern US. I grew up in Montana where three might be more appropriate.
2 Very important tools that will drive analysis infrastructure and invention. They deserve another post.
__________
Recipe Corner
I've been on a sweet potato kick lately. This is a modification of an old recipe I've had for a long time.
Sweet Potato Stew
Ingredients
° 1 tbl olive oil
° 1 medium yellow onion
° 2 cloves of garlic, minced
° 1 tsp grated ginger (note - peeling ginger with the edge of a spoon is so much easier than using a knife!)
° 1/2 tsp cumin
° 1/2 tsp cinnamon
° 1/2 tsp red pepper flakes (or some other way to simulate a small bit of heat)
° 2 medium sweet potatoes, peeled and cut into inchish chunks
° 1 cup vegetable stock
° a half can of San Marzano tomatoes crushed (about 14 oz)
° 1 19 oz can kidney beans
° 1/4 cup smooth peanut butter
Technique
° heat oil in a large pan over medium heat. Sauté onion 'til soft (don't brown)
° add ginger, garlic and spices and sauté 'til ou can smell them .. less than a minute!
° add veggie stock and tomatoes and sweet potatoes. Cover, cut heat to medium-low and simmer for about a half an hour 'til sweet potatoes are soft
° add kidney beans and heat. Finally stir in the peanut butter and serve.
Comments