Big Data – The monster re-defined
Every time Leo Messi scores at Camp Nou in Barcelona, almost one hundred thousand Barca fans cheer in support of their most prolific striker. Social media services such as Twitter, Instagram, and Facebook are instantaneously flooded with comments, views, opinions, analyses, photographs, and videos of yet another wonder goal from the Argentinian goalscorer. One such goal, scored in the semifinal of the UEFA Champions League, against Bayern Munich in May 2015, generated more than 25,000 tweets per minute in the United Kingdom alone, making it the most tweeted sports moment of 2015 in this country. A goal like this creates a widespread excitement, not only among football fans and sports journalists. It is also a powerful driver for the marketing departments of numerous sportswear stores around the globe, who try to predict, with a military precision, day-to-day, in-store, and online sales of Messi's shirts, and other FC Barcelona related memorabilia. At the same time, major TV stations attempt to outbid each other in order to show forthcoming Barca games, and attract multi-million revenues from advertisement slots during the half-time breaks. For a number of industries, this one goal is potentially worth much more than Messi's 20 million Euro annual salary. This one moment also creates an abundance of information, which needs to be somehow collected, stored, transformed, analyzed, and redelivered in the form of yet another product, for example, sports news with a slow-motion replay of Messi's killing strike, additional shirts dispatched to sportswear stores, or a sales spreadsheet and a marketing briefing outlining Barca's TV revenue figures.
Such moments, like memorable Messi's goals against Bayern Munich, happen on a daily basis. Actually, they are probably happening right now, while you are holding this book in front of your eyes. If you want to check what currently makes the world buzz, go to the Twitter web page and click on the Moments tab to see the most trending hashtags and topics at this very moment. Each of these less, or more, important events generates vast amounts of data in many different formats, from social media status updates to YouTube videos and blog posts to mention just a few. These data may also be easily linked with other sources of the event-related information to create complex unstructured deposits of data that attempt to explain one specific topic from various perspectives and using different research methods. But here is the first problem: the simplicity of data mining in the era of the World Wide Web means that we can very quickly fill up all the available storage on our hard drives, or run out of processing power and memory resources to crunch the collected data. If you end up having such issues when managing your data, you are probably dealing with something that has been vaguely denoted as Big Data.
Big Data is possibly the scariest, deadliest and the most frustrating phrase which can ever be heard by a traditionally trained statistician or a researcher. The initial problem lies in how the concept of Big Data is defined. If you ask ten, randomly selected, students what they understand by the term Big Data they will probably give you ten, very different, answers. By default, most will immediately conclude that Big Data has something to do with the size of a data set, the number of rows and columns; depending on their fields they will use similar wording. Indeed they will be somewhat correct, but it's when we inquire about when exactly normal data becomes Big that the argument kicks off. Some (maybe psychologists?) will try to convince you that even 100 MB is quite a big file or big enough to be scary. Some others (social scientists?) will probably say that 1 GB heavy data would definitely make them anxious. Trainee actuaries, on the other hand, will suggest that 5 GB would be problematic, as even Excel suddenly slows down or doesn't want to open the file. In fact, in many areas of medical science (such as human genome studies) file sizes easily exceed 100 GB each, and most industry data centers deal with data in the region of 2 TB to 10 TB at a time. Leading organizations and multi-billion dollar companies such as Google, Facebook, or YouTube manage petabytes of information on a daily basis. What is then the threshold to qualify data as Big?
The answer is not very straightforward, and the exact number is not set in stone. To give an approximate estimate we first need to differentiate between simply storing the data, and processing or analyzing the data. If your goal was to preserve 1,000 YouTube videos on a hard drive, it most likely wouldn't be a very demanding task. Data storage is relatively inexpensive nowadays, and new rapidly emerging technologies bring its prices down almost as you read this book. It is amazing just to think that only 20 years ago, $300 would merely buy you a 2GB hard drive for your personal computer, but 10 years later the same amount would suffice to purchase a hard drive with a 200 times greater capacity. As of December 2015, having a budget of $300 can easily afford you a 1TB SATA III internal solid-state drive: a fast and reliable hard drive, one of the best of its type currently available to personal users. Obviously, you can go for cheaper and more traditional hard disks in order to store your 1,000 YouTube videos; there is a large selection of available products to suit every budget. It would be a slightly different story, however, if you were tasked to process all those 1,000 videos, for example by creating shorter versions of each or adding subtitles. Even worse if you had to analyze the actual footage of each movie, and quantify, for example, how many seconds per video red colored objects of the size of at least 20x20 pixels are shown. Such tasks do not only require considerable storage capacities, but also, and primarily, the processing power of the computing facilities at your disposal. You could possibly still process and analyze each video, one by one, using a top-of-the-range personal computer, but 1,000 video files would definitely exceed its capabilities and most likely your limits of patience too. In order to speed up the processing of such tasks, you would need to quickly find some extra cash to invest into further hardware upgrades, but then again this would not solve the issue. Currently, personal computers are only vertically scalable to a very limited extent. As long as your task does not involve heavy data processing, and is simply restricted to file storage, an individual machine may suffice. However, at this point, apart from large enough hard drives, we would need to make sure we have a sufficient amount of Random Access Memory (RAM), and fast, heavy-duty processors on compatible motherboards installed in our units. Upgrades of individual components, in a single machine, may be costly, short-lived due to rapidly advancing new technologies, and unlikely to bring a real change to complex data crunching tasks. Strictly speaking, this is not the most efficient and flexible approach for Big Data analytics to say the least. A couple of sentence back, I used the plural units intentionally, as we would most probably have to process the data on a cluster of machines working in parallel. Without going into details at this stage, the task would require our system to be horizontally scalable, meaning that we would be capable of easily increasing (or decreasing) the number of units (nodes) connected in our cluster as we wish. A clear advantage of horizontal scalability over vertical scalability is that we would simply be able to use as many nodes working in parallel as required by our task, and we would not be bothered too much with the individual configuration of each and every machine in our cluster.
Let's go back now for a moment to our students and the question of when normal data becomes Big? Amongst the many definitions of Big Data, one is particularly neat and generally applicable to a very wide range of scenarios. One byte more than you are comfortable with is a well-known phrase used by Big Data conference speakers, but I can't deny that it encapsulates the meaning of Big Data very precisely, and yet it is non-specific enough it leaves the freedom to make a subjective decision to each one of us as to what and when to qualify data as Big. In fact, all our students, whether they said Big Data was as little as 100MB or as much as 10 petabytes, were more or less correct in their responses. As long as an individual (and his/her equipment) is not comfortable with a certain size of data, we should assume that this is Big Data for them. The size of data is not, however, the only factor that makes the data Big. Although the simplified definition of Big Data, previously presented, explicitly refers to the one byte as a measurement of size, we should dissect the second part of the statement, in a few sentences, to have a greater understanding of what Big Data actually means. Data do not just come to us and sit in a file. Nowadays, most data change, sometimes very rapidly. Near real-time analytics of Big Data currently gives huge headaches to in-house data science departments, even at international large financial institutions or energy companies. In fact stock-market data, or sensor data, are pretty good, but still quite extreme examples of high-dimensional data that are stored and analyzed at milliseconds intervals. Several seconds of delay in producing data analyses, on near real-time information, may cost investors quite substantial amounts, and result in losses in their portfolio value, so the speed of processing fast-moving data is definitely a considerable issue at the moment. Moreover, data are now more complex than ever before. Information may be scrapped off the websites as unstructured text, JSON format, HTML files, through service APIs, and so on. Excel spreadsheets and traditional file formats such as Comma-Separated Values (CSV) or tab-delimited files that represent structured data are not in the majority any more. It is also very limiting to think of data as of only numeric or textual types. There is an enormous variety of available formats that store, for instance, audio and visual information, graphics, sensors, and signals, 3D rendering and imaging files, or data collected and compiled using highly specialized scientific programs or analytical software packages such as Stata or Statistical Package for the Social Sciences (SPSS) to name just a few (a large list of most available formats is accessible through Wikipedia at https://en.wikipedia.org/wiki/List_of_file_formats ).
The size of data, the speed of their inputs/outputs and the differing formats and types of data were in fact the original three Vs: Volume, Velocity, and Variety, described in the article titled 3D Data Management: Controlling Data Volume, Velocity, and Variety published by Doug Laney back in 2001, as major conditions to treat any data as Big Data. Doug's famous three Vs were further extended by other data scientists to include more specific and sometimes more qualitative factors such as data variability (for data with periodic peaks of data flow), complexity (for multiple sources of related data), veracity (coined by IBM and denoting trustworthiness of data consistency), or value (for examples of insight and interpretation). No matter how many Vs or Cs we use to describe Big Data, it generally revolves around the limitations of the available IT infrastructure, the skills of the people dealing with large data sets and the methods applied to collect, store, and process these data. As we have previously concluded that Big Data may be defined differently by different entities (for example individual users, academic departments, governments, large financial companies, or technology leaders), we can now rephrase the previously referenced definition in the following general statement:
Big Data any data that cause significant processing, management, analytical, and interpretational problems.
Also, for the purpose of this book, we will assume that such problematic data will generally start from around 4 GB to 8 GB in size, the standard capacity of RAM installed in most commercial personal computers available to individual users in the years 2014 and 2015. This arbitrary threshold will make more sense when we explain traditional limitations of the R language later on in this chapter, and methods of Big Data in-memory processing across several chapters in this book.