Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Free Learning

Tech Guides - Big Data

50 Articles

article-image-reducing-cost-big-data-using-statistics-and-memory-technology-part-1

Praveen Rachabattuni

4 min read

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 1

Praveen Rachabattuni

The world is shifting from private, dedicated data centers to on-demand computing in the cloud. This shift moves the onus of cost from the hands of IT companies to the hands of developers. As your data sizes start to rise, the computing cost grows linearly with it. We have found that using statistical algorithms gives us a 95 percent accuracy rate, is faster, and is a lot more beneficial than waiting for the exact results. The following are some common analytical queries that we have often come across in applications: How many distinct elements are in the data set (that is, what is the cardinality of the data set)? What are the most frequent elements (that is, the “heavy hitters” and “top elements”)? What are the frequencies of the most frequent elements? Does the data set contain a particular element (search query)? Can you filter data based upon a category? Statistical algorithms for quicker analytics Frequently, statistical algorithms avoid storing the original data, replacing it with hashes that eliminate a lot of network. Let’s get into the details of some of these algorithms that can help answer queries similar to those mentioned previously. A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is suitable in cases when we need to quickly filter items that are present in a set. HyperLogLog is an approximate technique for computing the number of distinct entries in a set (cardinality). It does this while using only a small amount of memory. For instance, to achieve 99 percent accuracy, it needs only 16 KB. In cases when we need to count distinct elements in a dataset spread across a Hadoop cluster, we could compute the hashes on different machines, build the bit index, and combine the bit index to compute the overall distinct elements. This eliminates the need of moving the data across the network and thus saves us a lot of time. The Count–min sketch is a probabilistic sub-linear space streaming algorithm that can be used to summarize a data stream to obtain the frequency of elements. It allocates a fixed amount of space to store count information, which does not vary over time even as more and more counts are updated. Nevertheless, it is able to provide useful estimated counts, because the accuracy scales with the total sum of all the counts stored. Spark - a faster execution engine Spark is a faster execution engine that provides 10 times the performance over MapReduce when combined with these statistical algorithms. Using Spark with statistical algorithms gives us a huge benefit both in terms of cost and time savings. Spark gets most of its speed by constructing Directed Acyclic Graphs (DAGs) out of the job operations and uses memory to save intermediate data, thus making the reads faster. When using statistical algorithms, saving the hashes in memory makes the algorithms work much faster. Case study Let’s say we have a continuous stream of user log data coming every hour at a rate of 4.4 GB per hour, and we need to analyze the distinct IPs in the logs on a daily basis. At my old company, when MapReduce was used to process the data, it was taking about 6 hours to process one day’s worth of data at a size of 106 GB. We had an AWS cluster consisting of 50 spot instances and 4 on-demand instances running to perform the analysis at a cost of $150 per day. Our system was then shifted to use Spark and HyperLogLog. This shift brought down the cost to $16.50 per day. To summarize, we had a 3.1 TB stream of data processed every month at a cost of $495, which was costing about $4,500 on the original system using MapReduce without the statistical algorithm in place. Further reading In the second part of this two-part blog series, we will discuss two tools in depth: Apache Spark and Apache Pig. We will take a look at how Pig combined with Spark makes existing ETL pipelines 100 times faster, and we will further our understanding of how statistical perspectives positively effect data analytics. About the author Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.

0
0
1314

article-image-what-did-big-data-deliver-in-2014

5 min read

What Did Big Data Deliver In 2014?

Big Data has always been a hot topic and in 2014 it came to play. ‘Big Data’ has developed, evolved and matured to give significant value to ‘Business Intelligence’. However there is so much more to big data than meets the eye. Understanding enormous amounts of unstructured data is not easy by any means; yet once that data is analysed and understood, organisations have started to value its importance and need. ‘Big data’ has helped create a number of opportunities which range from new platforms, tools, technologies; to improved economic performances in different industries; through development of specialist skills, job creation and business growth. Let’s do a quick recap of 2014 and on what Big Data has offered to the tech world from the perspective of a tech publisher. Data Science The term ‘Data Science’ has been around for sometime admittedly, yet in 2014 it received a lot more attention thanks to the demands created by ‘Big Data’. Looking at Data Science from a tech publisher’s point of view, it’s a concept which has rapidly been adopted with potential for greater levels of investment and growth. To address the needs of Big data, Data science has been split into four key categories, which are; Data mining, Data analysis, Data visualization and Machine learning. Equally we have important topics which fit inbetween those such as: Data cleaning (Munging) which I believe takes up majority of a data scientist time. The rise in jobs for data scientists has exploded in recent times and will continue to do so, according to global management firm McKinsey & Company there will be a shortage of 140,000 to 190,000 data scientists due to the continued rise of ‘big data’ and also has been described as the ‘Sexiest job of 21st century’. Real time Analytics The competitive battle in Big data throughout 2014 was focused around how fast data could be streamed to achieve real time performance. Real-time analytics most important feature is gaining instant access and querying data as soon as it comes through. The concept is applicable to different industries and supports the growth of new technologies and ideas. Live analytics are more valuable to social media sites and marketers in order to provide actionable intelligence. Likewise Real time data is becoming increasing important with the phenomenon known as ‘the internet of things’. The ability to make decisions instantly and plan outcome in real time is possible now than before; thanks to development of technologies like Spark and Storm and NoSQL databases like the Apache Cassandra, enable organisations to rapidly retrieve data and allow fault tolerant performance. Deep Learning Machine learning (Ml) became the new black and is in constant demand by many organisations especially new startups. However even though Machine learning is gaining adoption and improved appreciation of its value; the concept Deep Learning seems to be the one that’s really pushed on in 2014. Now granted both Ml and Deep learning might have been around for some time, we are looking at the topics in terms of current popularity levels and adoption in tech publishing. Deep learning is a subset of machine learning which refers to the use of artificial neural networks composed of many layers. The idea is based around a complex set of techniques for finding information to generate greater accuracy of data and results. The value gained from Deep learning is the information (from hierarchical data models) helps AI machines move towards greater efficiency and accuracy that learn to recognize and extract information by themselves and unsupervised! The popularity around Deep learning has seen large organisations invest heavily, such as: Googles acquisition of Deepmind for $400 million and Twitter’s purchase of Madbits, they are just few of the high profile investments amongst many, watch this space in 2015! New Hadoop and Data platforms Hadoop best associated with big data has adopted and changed its batch processing techniques from MapReduce to what’s better known as YARN towards the end of 2013 with Hadoop V2. MapReduce demonstrated the value and benefits of large scale, distributed processing. However as big data demands increased and more flexibility, multiple data models and visual tool became a requirement, Hadoop introduced Yarn to address these problems. YARN stands for ‘Yet-Another-Resource-Negotiator’. In 2014, the emergence and adoption of Yarn allows users to carryout multiple workloads such as: streaming, real-time, generic distributed applications of any kind (Yarn handles and supervises their execution!) alongside the MapReduce models. The biggest trend I’ve seen with the change in Hadoop in 2014 would be the transition from MapReduce to YARN. The real value in big data and data platforms are the analytics, and in my opinion that would be the primary point of focus and improvement in 2015. Rise of NoSQL NoSQL also interpreted as ‘Not Only SQL’ has exploded with a wide variety of databases coming to maturity in 2014. NoSQL databases have grown in popularity thanks to big data. There are many ways to look at data stored, but it is very difficult to process, manage, store and query huge sets of messy, complex and unstructured data. Traditional SQL systems just wouldn’t allow that, so NoSQL was created to offer a way to look at data with no restrictive schemas. The emergence of ‘Graph’, ‘Document’, ‘Wide column’ and ‘Key value store’ databases have showed no slowdown and the growth continues to attract a higher level of adoption. However NoSQL seems to be taking shape and settling on a few major players such as: Neo4j, MongoDB, Cassandra etc, whatever 2015 brings, I am sure it would be faster, bigger and better!

0
0
1277

article-image-big-data-more-than-just-buzz-word

4 min read

Big Data Is More Than Just a Buzz Word!

We all agree big data sounds cool (well I think it does!), but what is it? Put simply, big data is the term used to describe massive volumes of data. We are thinking of data along the lines of "Terabytes," "Petabytes," and "Exabytes" in size. In my opinion, that’s as simple as it gets when thinking about the term "big data." Despite this simplicity, big data has been one of the hottest, and most misunderstood, terms in the Business Intelligence industry recently; every manager, CEO, and Director is demanding it. However, once the realization sets in on just how difficult big data is to implement, they may be scared off! The real reason behind the "buzz" was all the new benefits that organizations could gain from big data. Yet many overlooked the difficulties involved, such as: How do you get that level of data? If you do, what do you do with it? Cultural change is involved, and most decisions would be driven by data. No decision would be made without it. The cost and skills required to implement and benefit from big data. The concept was misunderstood initially; organisations wanted data but failed to understand what they wanted it for and why, even though they were happy to go on the chase. Where did the buzz start? I truly believe Hadoop is what gave big data its fame. Initially founded by Yahoo, used in-house, and then open sourced as an Apache project, Hadoop served a true market need for large scale storage and analytics. Hadoop is so well linked to big data that it’s become natural to think of the two together. The graphic above demonstrates the similarities in how often people searched for the two terms. There’s a visible correlation (if not causation). I would argue that “buzz words” in general (or trends) don’t take off before the technology that allows them to exist does. If we consider buzz words like "responsive web design", they needed the correct CSS rules; "IoT" needed Arduino, and Raspberry Pi and likewise "big data" needed Hadoop. Hadoop was on the rise before big data had taken off, which supports my theory. Platforms like Hadoop allowed businesses to collect more data than they could have conceived of a few years ago. Big data grew as a buzz word because the technology supported it. After the data comes the analysis However, the issue still remains on collecting data with no real purpose, which ultimately yields very little in return; in short, you need to know what you want and what your end goal is. This is something that organisations are slowly starting to realize and appreciate, represented well by Gartner’s 2014 Hype Cycle. Big data is currently in the "Trough of Disillusionment," which I like to describe as “the morning after the night before.” This basically means that realisation is setting in, the excitement and buzz of big data has come down to something akin to shame and regret. The true value of big data can be categorised into three sections: Data types, Speed, and Reliance. By this we mean: the larger the data, the more difficult it becomes to manage the types of data collected, that is, it would be messy, unstructured, and complex. The speed of analytics is crucial to growth and on-demand expectations. Likewise, having a reliable infrastructure is at the core for sustainable efficiency. Big data’s actual value lies in processing and analyzing complex data to help discover, identify, and make better informed data-driven decisions. Likewise, big data can offer a clear insight into strengths, weaknesses, and areas of improvements by discovering crucial patterns for success and growth. However, this comes at a cost, as mentioned earlier. What does this mean for big data? I envisage that the invisible hand of big data will be ever present. Even though devices are getting smaller, data is increasing at a rapid rate. When the true meaning of big data is appreciated, it will genuinely turn from a buzz word into one that smaller organisations might become reluctant to adopt. In order to implement big data, they will need to appreciate the need for structure change, the costs involved, the skill levels required, and an overall shift towards a data-driven culture. To gain the maximum efficiency from big data and appreciate that it's more than a buzz word, organizations will have to be very agile and accept the risks to benefit from the levels of change.

0
0
1583

article-image-python-data-stack

3 min read

Python Data Stack

The Python programming language has grown significantly in popularity and importance, both as a general programming language and as one of the most advanced providers of data science tools. There are 6 key libraries every Python analyst should be aware of, and they are: 1 - NumPY NumPY: Also known as Numerical Python, NumPY is an open source Python library used for scientific computing. NumPy gives both speed and higher productivity using arrays and metrics. This basically means it's super useful when analyzing basic mathematical data and calculations. This was one of the first libraries to push the boundaries for Python in big data. The benefit of using something like NumPY is that it takes care of all your mathematical problems with useful functions that are cleaner and faster to write than normal Python code. This is all thanks to its similarities with the C language. 2 - SciPY SciPY: Also known as Scientific Python, is built on top of NumPy. SciPy takes scientific computing to another level. It’s an advanced form of NumPy and allows users to carry out functions such as differential equation solvers, special functions, optimizers, and integrations. SciPY can be viewed as a library that saves time and has predefined complex algorithms that are fast and efficient. However, there are a plethora of SciPY tools that might confuse users more than help them. 3 - Pandas Pandas is a key data manipulation and analysis library in Python. Pandas strengths lie in its ability to provide rich data functions that work amazingly well with structured data. There have been a lot of comparisons between pandas and R packages due to their similarities in data analysis, but the general consensus is that it is very easy for anyone using R to migrate to pandas as it supposedly executes the best features of R and Python programming all in one. 4 - Matplotlib Matplotlib is a visualization powerhouse for Python programming, and it offers a large library of customizable tools to help visualize complex datasets. Providing appealing visuals is vital in the fields of research and data analysis. Python’s 2D plotting library is used to produce plots and make them interactive with just a few lines of code. The plotting library additionally offers a range of graphs including histograms, bar charts, error charts, scatter plots, and much more. 5 - scikit-learn scikit-learn is Python’s most comprehensive machine learning library and is built on top of NumPy and SciPy. One of the advantages of scikit-learn is the all in one resource approach it takes, which contains various tools to carry out machine learning tasks, such as supervised and unsupervised learning. 6 - IPython IPython makes life easier for Python developers working with data. It’s a great interactive web notebook that provides an environment for exploration with prewritten Python programs and equations. The ultimate goal behind IPython is improved efficiency thanks to high performance, by allowing scientific computation and data analysis to happen concurrently using multiple third-party libraries. Continue learning Python with a fun (and potentially lucrative!) way to use decision trees. Read on to find out more.

0
0
25043

article-image-mysteries-big-data-and-orient-db

4 min read

The Mysteries of Big Data and the Orient … DB

Mapping the world of big data must be a lot like demystifying the antiquated concept of the Orient, trying to decipher a mass of unknowns. With the ever multiplying expanse of data and the natural desire of humans to simultaneously understand it—as soon as possible and in real time—technology is continually evolving to allow us to make sense of it, make connections between it, turn it into actionable insight, and act upon it physically in the real world. It’s a huge enterprise, and you’ve got to imagine with the masses of data collated years before on legacy database systems, without the capacity for the technological insight and analysis we have now, there are relationships within the data that remain undefined—the known unknowns, the unknown knowns, and the known knowns (that Rumsfeld guy was making sense you see?). It's fascinating to think what we might learn from the data we have already collected. There is a burning need these days to break down the mysteries of big data and developers out there are continually thinking of ways we can interpret it, mapping data so that it is intuitive and understandable. The major way developers have reconceptualized data in order to make sense of it is as a network connected tightly together by relationships. The obvious examples are Facebook or LinkedIn, which map out vast networks of people connected by various shared properties, such as education, location, interest, or profession. One way of mapping highly connectable data is by structuring data in the form of a graph, a design that has emerged in recent years as databases have evolved. The main progenitor of this data structure is Neo4j, which is far and away the leader in the field of graph databases, mobilized by a huge number of enterprises working with big data. Neo4j has cornered the market, and it's not hard to see why—it offers a powerful solution with heavy commercial support for enterprise deployments. In truth there aren't many alternatives out there, but alternatives exist. OrientDB is a hybrid graph document database that offers the unique flexibility of modeling data in the form of either documents, or graphs, while incorporating object-oriented programming as a way of encapsulating relationships. Again, it's a great example of developers imagining ways in which we can accommodate the myriad of different data types, and relationships that connect it all together. The real mystery of the Orient(DB) however, is the relatively low (visible) adoption of a database that offers both innovation, and reputedly staggering levels of performance (claims are that it can store up to 150,000 records a second). The question isn't just why it hasn't managed to dent a market essentially owned by Neo4j, but why, on its own merits, haven’t more developers opted for the database? The answer may in the end be vaguely related to the commercial drivers—outside of Europe it seems as if OrientDB has struggled to create the kind of traction that would push greater levels of adoption, or perhaps it is related to the considerable development and tuning of the project for use in production. Related to that, maybe OrientDB still has a way to go in terms of enterprise grade support for production. For sure it's hard to say what the deciding factor is here. In many ways it’s a simple reiteration of the level of difficulty facing startups and new technologies endeavoring to acquire adoption, and that the road to this goal is typically a long one. Regardless, what both Neo4j and OrientDB are valuable for is adapting both familiar and unfamiliar programming concepts in order to reimagine the way we represent, model, and interpret connections in data, mapping the information of the world.

0
0
1751

Previous
1
2
3
4
Next