Cold water: not all of the data is big data! Don’t burn!

big data! If you don’t, you’d better make a figure, after all your competitors are doing. Saying: if your data is too small, your competitors will beat you! Like play you in the face and then steal your girlfriend.

the above narrative is the recent popularity of big data. However the conventional arguments behind a lot of problems. Big data, mainly because of consulting and IT company want to do business by new concept by hype. Fortunately, some honest Big Data (Big Data) practitioners (also known as Data expert) skeptical about this culture. They provided us with a series of reason of the fanatical tone:

even giants like Facebook and yahoo are usually don’t have big data, Google’s job is not suitable for all companies

Facebook and yahoo are Internet giant, inside has a powerful data server cluster, in order to process the data. Need to use the server cluster to handle, it is often seen as a sign of “big data”. After all, can in the home computer processing data, it is not big enough. But the problem is divided into several parts and then put in large array running on your computer processing, is the big data characteristic of the classic. Just like Google needed a large cluster to calculate all the site’s ranking on the earth.

but for Facebook and yahoo’s many daily tasks, this cluster seems not necessary. Facebook, for example, most of the engineers only need to use their cluster to finish the work of Mb level, these data can easily handle on personal computers, even laptop computers can easily do.

it’s the same yahoo, yahoo to cluster to handle many of the tasks of the median value is about 12.5 Gb, which is larger than average PC processing capacity, but single server can bear completely.

these leaked news from Microsoft research published online in a document, the document title name is “incredibly nobody was fired for buying a server cluster”. This paper points out that even the most need data processing ability of the company, its engineers fix problems don’t need to use large server cluster. See the problem? Server cluster data processing ability of these companies tend to be wasted – or even a lot of cluster is ornament.

how big data has become synonymous with the term “data analysis? The misunderstanding caused the messy

“data analysis” is a very old concept. The ancient Egyptian pharaoh in statistics to the state Treasury the storage with data analysis. But now seems not add a “big” in the front was embarrassed to mention the word “data”. Some of it is the work of “data analysis”, to be as big data. Even a similar “introduce large data into your small business” articles. But the above mentioned data processing, can get even Google Docs, let alone a Exel!

must identify: in fact, most companies are very low-end data processing, like the open knowledge foundation Rufus Pollock had named: is all some “data”.

unrealistic pursuit of big data waste money little

the more data really is better? Of course not. In fact, if you just want to analyze correlation, as long as find information about relationship between X and Y. Didn’t use to collect more information and even only negative effect.

at the authority of the media analysis company Lithium data analyst Michael Wu wrote: “after more than a certain amount of data, the efficiency of extracting information from large data is more and more low.” If you are concerned about big data at ordinary times is not enough, that is the translation of this sentence: once a large data size is more than a critical point, then to add data is not cost-effective, just pure time-consuming.

one of the reasons for this: when you need to look for correlation data, the greater the one side, the more the more wrong. As a data scientist Vincent Granville in the curse of the big data has written: “this is not difficult to explain. For example, even a factor in the data set consisted of 1000, the number of the relationship between these factors is as high as millions of level. Means that some of the factors of the relationship between may is totally random, in order to build prediction model, you will lose miserably.”

sometimes, big data is a point of no return…

when companies began to make the data, while they were on a difficult path, must learn to understand some very difficult academic concepts, statistics, data quality, and so on all the content related to “data science”. However, as all disciplines – data science is also filled with many unable to verify, do not know true and false, is difficult to identify the theory and method. The pool is deep!

the deviation in the process of data collection, data, the lack of context, the collected data of fault, calculus method, etc. These errors, in the end is likely to be the big data analysis is oriented in the wrong direction, even the best data researchers can’t please everyone. MIT Media Lab visiting scholar Kate Crawford say: “we may be wishful thinking to confuse the algorithm”. In other words, if you got big data, but don’t think that your company is the IT department of Tom, dick and harry who can help you deal with IT. You may need to find a doctorate trained cow X, or have the same experience of Daniel. Even if you find this person, he also help you to analyze, finally he could advice you, in fact you don’t need any “big data”……

so, big and small data which a bit better?

you are engaged in business data to need? Of course. However, as the picture shows, blindly following talented big data is silly X a boss. Since the discipline foundation, these problems are still with scientific data: data quality, the overall objective, situation and the correlation of data, in the enterprise use the data to make decisions, while also follow these problems. Please remember, Mendel untied genetic secret writing took a notebook full of data. The important thing is to collect the right data, not as much as possible the most large amounts of data collection.

sources: the comic author: