Is Big Data a Problem or a Solution?
I am a little ashamed to confess it but I have a problem with Big Data. It is not a technical problem — although it kind of is a technical problem — with Big Data. My problem is not with the techniques, or the size and complexity of the data but it is a semantical problem with the expression "Big Data".
The expression "Big Data" — from now on without the quotes as it is tedious to quote it every time — is a buzz-work invented John Mashey, who was the chief scientist at Silicon Graphics in the 1990s, as traced back by Steve Lohr of The New York Times and it has been used to summarize a full range of problems associated with large data amounts and that the super-computers made by Silicon Graphics — beautiful machines indeed — would have helped solving. It was catchy and not well defined: all things that a sales department can dream for.
So, it seems that at the beginning, Bid Data really meant dealing with vast amount of data that only a multi-million machine could have solved, but in more recent times it seems that Big Data started assuming some other meanings like dealing with a vast amount of unstructured and giving it a sense through the application of some specific techniques. The unstructured part of it is a false problem as data should always be prepared and normalized before the use, exactly in the same way as when you wash, rinse and cut vegetables before cooking them: no one want to deal with dirt or bad data. The part about the specific techniques is the one that puzzles me indeed. But, if Big Data is also a collection of techniques — or, if you prefer, ways to approach the problem — it should lay in the solution space and not in the problem one. What I mean is that if in its definition — or common meaning — it also includes the common knowledge of how to apply solutions to it, well, that should not belong to the space of the problem.
For me, Big Data is part of the solution space and not of the problem space. The initial size of the data is not a concern, aside for the costs that could not be that big of a problem because if a problem is important enough it should deserve investments to solve it, nor its initial complexity — missing structure, etc. — but the central ideas is that data can be grown to make that even larger for solving the problem.
What I mean with growing the data is to really augment the data that you have with more even data, both insourced or outsourced, to complete it and create connections that maybe were not there before, or that were well hidden.
Starting from the outsourced part, which is more difficult to explain, I mean the possibility to obtain other pieces of data that could complement the information that you have adding more facets to that. Let's think about the way search engines provide related entities for a given query let's say when the users are looking for an actor (you can experience that yourself typing "George Clooney" in the search bar of your favourite search engine and looking at sections of the page such as "People also search for", etc.). All of those pieces of information could — at least initially — being inferred by looking at what each user looked for in sequence or, as it is usually called, in a "session": if I look for George Clooney and then for Brad Pitt, I can then say that there must be some kind of relationship between these you people. But here it is the first problem: you need to know that those two queries are "people" and, probably, you also need to know that they are actors. To do this, the canonical solution is to augment the data — in this case the query log — with other pieces of information such as people data-bases, movies data-bases, etc. and being able to link them, e.g., George Clooney is an actor who starred in the movie Ocean's Twelve, Brad Pitt is an actor as well and he starred in Ocean's Twelve too → George Clooney and Brad Pitt are both actors that are related through the acting in the movie Ocean's Twelve → for the query George Clooney, both Brad Pitt and Ocean's Twelve make good suggestions as they do the same job and they participated to the same movie. The fact that the query George Clooney is about an actor and actors have roles into movies is a piece of information that is not in the query log.
By insourced, I mean the transformation of the data itself adding and extracting more knowledge from the dataset than it seems to be there at a first gaze. Let's go for an example as it facilitates the explanation. Suppose that we have available some data that provides the description, through some characteristics, of houses and their prices, and that we are interested in understanding which of — and how much — those characteristics influence the price of each house. Let's also suppose that while we do not have too much data around the houses in our dataset, we at least have two pieces of information: ① the size of the house and ② its proximity to a station. Let's say that we start discovering — without too much surprise — that the size of a house has almost a direct relationship with the price: the larger the house, the higher the price. Good, nothing new under the sun. On the other hand, we could start noticing something interesting about the relationship between the final price of a house and its distance from a station: houses that tend to be too far or too close to a station have lower prices than houses that are somehow not too far or not too close. How so? Here, the work of a data scientist stats to be closer to the one of a detective from CSI than the one of a pure mathematician or computer scientist. The fact — as some thoughts and discussions could reveal — is that being too far makes obviously the house less practical for commuters as the time to reach the station could be high and it could require more money, for instance, to buy the tickets to go from the house to the station by bus and, at the same time, being too close can give problems with the noise produced by the trains at night, stations can sometimes be dodgy places, etc. and this can influence the price of the house. To take these things into account, we could start analysing the variation of the price with respect to the distance and find some thresholds that determine when something is too close or too far and add them back to the dataset in order to have more refined information about the data points ♠. This, and it goes without saying, is actually increasing the data with knowledge but either in space as this data will need to be stored somewhere or in processing time to compute this new characteristic every time if we prefer saving space (which is usually not a terribly good idea as disks are very cheap nowadays).
So, from a certain point of view — or at least mine —, Big Data is more about making data big than dealing with data that was big to start with. And this kind of implies that it should belong to the solution space more than to the problem space as it is often used by the most. I rest my case.
♠ Bear with me, this is a toy example and many more things could be done and experimented in this case. Thresholding is one of these and, for instance, there are quite a large amount of different ways to do it. Moreover, other important characteristics are missing from the example, such as the quality of the area; availability of schools, restaurants and supermarkets; if the house has a garden or not; etc.