The other day someone asked my opinion on Monsanto getting into agricultural "big data".. People don't seem to agree on what that means, so I'll offer my definition, but first I should note that "data" means something different to someone in the physical sciences than the common usage.1
"Big" is also loaded. Big as compared to what? I've seen datasets for some Fortune 50 companies that are trivially small compared to a week's worth of post trigger data from the LHC, but they are still sufficient to be overwhelming. When I was a grad student in the 1970s my thesis experiment generated about 250 gigabytes a data and required a few years to analyze. That was enormous at the time. Our grant paid for several hundred thousand dollars worth of time on a supercomputer of the day that had less computational power than my iPhone 5.
Big data, for me, is dealing with the collection, processing and analysis of information that is at the frontier of what is possible.
The devil is in the details and lurks at every step. In the early 90s a few of us looked into the recording and billing system that AT&T used for domestic long distance calls. This involved call detail records from 4ESS and 5ESS switches made by AT&T and DMS 100 and 200 switches made by Northern Telecom. Call detail records are simple things - time stamps associated with the beginning and end of the call, some routing information, the type of call and so on. Much of this had been cobbled together over several decades and too many assumptions had been made along the way. We found error rates on the order of nearly two percent and were able to sort out quite a bit of what was going on in the system (the main technical report ran nearly 200 pages). None of this had been properly characterized before and we underbelly missed bits and pieces, but we were able to characterize the system well enough to supply confidence levels on our measurements (we were a couple of physicists and a mathematician).
Model building is extremely important. You have to understand not only what you area measuring, but what you think it means. This is can be extremely difficult in the social world and some huge leaps in logic are frequently made. I've seen very robust work as well as work that is completely off base. Sometimes there is enough information in your datasets to work with and sometimes other techniques are much more powerful. Part of the success of Trader Joe's comes from being able to distill customer feedback that comes from their workers and be able to make good use of the information. Big data analysis is not a big part of their strategy - it doesn't have to be as they are smart enough to make use of other more appropriate tools.
There is a danger associated with the misapplication of current and emerging tools. Too often the underlying mechanisms are poorly understood and sometimes it is difficult or impossible to know what the tools are exactly doing. The user may be well versed in their field but may not understand the problem at hand, details of the models being used, the quality of the data or several other critical components. Compounding the problem is many of the newer tools is they have very powerful visualization components. People tend to be more accepting and less skeptical of beautiful visual design. ... lipstick for the pig
Too many organizations are jumping on the bandwagon simply because they think there is gold in their haystacks and, after all, isn't Hadoop cool?
Some have a solid understanding of their domain and can produce results that can be characterized. I'm guessing there will be something of a bubble with a lot of less serious organizations getting burned in the next few years and a smaller number finding it is worth the serious effort to get things right.
Beyond that there are a few interesting things to note. Particle physics is sometimes called the elder statesman of big data ... a lot of serious work understanding data collection, trigger analysis, modeling and analysis with results being compared with the work of others who were using their own techniques. The complexity of the debris from a collision event increased with the collision energy of the accelerators and the resulting computational challenges have been matched (so far) by a combination of Moore's Law and rapidly advancing algorithmic approaches.2
The Large Hadron Collider only keeps track of a tiny fraction of the data pre-selected by hardware and software based models called triggers that determine what is important. About three petabytes of data a month were kept going out to a network of several hundred thousand computers in three dozen countries. The LHC is being modified for higher energies as is the computational environment. When it turns on the data rates will increase significantly. But even though this is a lot of data, it is relatively simple. The events are isolated - there is no correlation between events. Event correlation is a major challenge in astronomical surveys - the Square Kilometer Array, the Sloan Digital Sky Survey, the Dark Energy Survey and the Large Synoptic Survey Telescope come.
The LSST records at about 500 attributes of 20 or so billion objects in the sky. A time series of these objects are recorded. Attributes of these objects can change with time and there is a good deal of complexity associated with the early stages of capture and categorization - the transparency of the sky from night to night can dramatically change what something looks like. Finding unexpected correlations in 500 dimensional spaces is at the edge and beyond what we can do at the moment .. by my definition this is big data.3
A lot of research is going into exotic areas like compressed sensing, maximal information coefficient, and topological analysis and people are beginning to progress. Some of it will be useful in other domains.
As a grad student I spent time with a Swede and a Californian hacking a fax machine to increase its resolution by a factor of about four. We needed to communicate the sub and superscripts in our equations more clearly. At the same time people were using the emerging computer internetworks to move computer typeset documents. This became very common as the 80s went on, but order needed to be made of it. There was a guy known as TimBL at CERN with an interesting proposal that he later implemented in the form of an HTTP client and server. All to support the need for particle physics documents - usually time sensitive preprints - be be accessible by collaborations of groups around the world. Sir Timothy Berners-Lee OM, KBE, FRS, FREng, FRSA had invented the web.
Mark Twain supposedly said something like "history doesn't repeat itself, but it does rhyme..." I suspect folks doing the pure research in particle physics and astrophysics will create tools that will have an enormous impact on society.
1 Not to go into detail, but information is the more basic term in physics. Data generally has meta-information attached to it - how it was measured and to what accuracy for example. Within the field it is assumed that someone has a good characterization of the meta-information when they invoke the term data.
It should be said that physics is deeply into detail and understanding an apparatus, models and mathematical manipulations are of fundamental importance to the sport.
2 Particle physicists learn about the very small by slamming particles into each other at nearly the speed of light. The amount of energy during the collision is similar to early periods in the formation of the universe and allow the creation of exotic forms of matter. A friend notes that you wouldn't want to have one of us study a car or an airplane.
3 The complexity is beyond linear it goes as 2N, where N is the number of dimensions. This is far faster than improvements from Moore's Law. There are approaches to reason with this, but much remains to be done.
A pressure cooker recipe. If you don't have a pressure cooker and regularly cook, drop what you are doing and get one. I did quite a bit of asking around and ended up with a Kuhn Rikon Duromatic and wouldn't hesitate to recommend it. A bit spendy, but it is bulletproof and just works - an acceptable price for one of the most important pieces of kit in my kitchen.
Red Lentils and Jasmine Rice
° 2 tsp olive oil
° 1 tbl minced ginger
° 2 cloves garlic, minced
° 2 cups jasmine rice (I used a brown jasmine)
° 1/2 c red lentils (you could use other varieties)
° 3 cups water or vegetable broth (really rich if you have a good broth around, but ok with water)
° sea salt
° heat the cooker over medium heat, add the oil. Sauté the garlic and ginger for about a minute. Stir in the rice, lentils and add the fluid.
° lock the lid and bring to pressure. Lower the heat to maintain pressure and let it cook for 18 minutes.
° move pot off the burner and let it come to atmospheric pressure naturally
° season to taste (a bit of salt for me)