Big Data does not necessarily mean Good Data. And that, as an increasing number of experts are saying more insistently, means Big Data does not automatically yield good analytics.
If the data is incomplete, out of context or otherwise contaminated, it can lead to decisions that could undermine the competitiveness of an enterprise or damage the personal lives of individuals.
One of the classic stories of how data out of context can lead to distorted conclusions comes from Harvard University professor Gary King, director of the Institute for Quantitative Social Science. A Big Data project was attempting to use Twitter feeds and other social media posts to predict the U.S. unemployment rate, by monitoring key words like "jobs," "unemployment," and "classifieds."
Using an analytics technique called sentiment analysis, the group collected tweets and other social media posts that included these words to see if there were correlations between an increase or decrease in them and the monthly unemployment rate.
[Related: Big Data investigations: Opportunity and risk ]
While monitoring them, the researchers noticed a huge spike in the number of tweets containing one of those key words. But, as King noted, they later discovered it had nothing to do with unemployment. "What they hadn't noticed was Steve Jobs died," he said.
In the telling, it's a somewhat humorous story, outside of the tragedy of Jobs' untimely passing. But the lesson is a deadly serious one for those looking to rely on the magic of Big Data to guide their decisions.
King said the mix-up over the dual meanings of "jobs" is, "just one of many similar anecdotes. Anyone working in this area has had similar experiences."
"Lists of keywords, curated by human beings, work OK for the short run, but tend to fail catastrophically over the long run," he said. "You can fix it up by adding exceptions, but there's a lot of human labor involved."
He said it is easy for anyone to create their own example just by entering a keyword into the Bing Social page.
"You'll see some relevant things and some irrelevant. If you don't change the query and watch over time, you will often find the conversation veering away in some way — sometimes a little, sometimes not at all for a while, and sometimes dramatically," he said.
But King said that overall there are many examples of big data analytics producing useful things, "so failures tend not to appear in the literature."
Kim Jones, senior vice president and CSO of Vantiv, said this is not a new problem, but one that can be magnified if people think massive amounts of data are magically going to produce good analytics.
"The Jobs example was a classic case of data without context. Data by itself doesnt equal intelligence," he said.
King agrees that context is key. He is co-founder and chief scientist of Crimson Hexagon, a big-data analytics firm that, in the words of Wayne St. Amand, its executive vice president of marketing, seeks to provide, "context, meaning and structure to online conversations."
Yet there are increasing examples of data without context driving decisions. The Wall Street Journal reported in February on health insurance companies using Big Data to create profiles of their members. Among the things the companies tracked was a history of buying plus-sized clothes, which could lead to a mandatory referral to weight-loss programs.
Few people would argue with encouraging people to live healthier lives, but the privacy implications are disturbing. It is possible the person buying those clothes might have been doing so for another family member. And it is not always so benign. Bloomberg BusinessWeek reported in 2008 on individuals being denied health insurance based on a history of prescription drug purchases that suggested even minor mental health conditions.
Adam Frank, writing on the National Public Radio blog, noted that in some cases banks will deny a loan to someone based in part on their contacts on the employment networking site LinkedIn or the social networking behemoth Facebook. If your "friends" are deadbeats, your credit-worthiness may be based on their reliability.
Frank quoted Jay Stanley, senior policy analyst at the ACLU, noting on that groups blog that, "Credit card companies sometimes lower a customer's credit limit based on the repayment history of the other customers of stores where a person shops. Such 'behavioral scoring' is a form of economic guilt-by-association based on making statistical inferences about a person that go far beyond anything that person can control or be aware of."
Kim Jones said the tendency to jump to a conclusion from correlations without further analysis could have affected him personally. "During the late '80s and early '90s, data showed that Hispanic and Black males between the ages of 20 and 27 who were driving an entry-level luxury car on the I-95 corridor were likely to be drug runners," he said.
"I fit some of that profile — I'm African American, I was that age and at that time I was driving a car like that. But if I had been stopped, the police would have seen that I was wearing an Army uniform with Second Lt. bars and had a West Point ring," he said.
The point, he said, is that, "its always bad to rely just on data analytics. When you take the human element out of the equation, you by definition create a higher error rate."
In short, Big Data is a tool, but should not be considered the solution. "It can help you narrow something down from millions to perhaps 150," Jones said, "but the temptation is to let the computer do it all, and that is what is going to get you in trouble."