0

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Free Learning

Tech News - Big Data

12 Articles

article-image-packt-teams-up-with-humble-bundle-again-to-bring-readers-big-data-content

2 min read

Packt teams up with Humble Bundle again to bring readers big data content

Packt has teamed up with Humble Bundle once again to bring readers an incredible range of content - while also supporting some incredible causes. This month Packt has put together a selection of its best big data eBooks, videos and courses for Humble Bundle fans. Featuring DRM-free content worth $1479 in total, you can pick it all up for a minimum of $15. With Humble Bundle and Packt supporting Mental Health Foundation and Charity: Water, it's a good opportunity to not only pick up stellar selection of content to learn and master the software behind modern big data, but also to help organizations doing really important work. You can find the offer on Humble Bundle here. The offer ends August 27 2018. Which big data eBooks and videos feature in this month's Humble Bundle? You can rest assured that Packt has provided Humble Bundle with some of its most popular big data eBooks and videos. Covering everything from big data architecture to analytics and data science, you could make a big investment in your skill set, for the price of lunch. Here's what you get... Pay at least $1 and you'll get... Mastering Apache Spark 2.x, Second Edition Splunk Essentials, Second Edition MongoDB Cookbook, Second Edition Getting Started with Hadoop 2.x [Video] Learning ElasticSearch 5.0 [Video] Three months of Mapt Pro for $30 Or pay $8 and get all of the above as well as... Modern Big Data Processing with Hadoop Apache Hive Essentials, Second Edition Learning Elastic Stack 6.0 Learning Hadoop 2 Apache Spark with Scala [Video] Working with Big Data in Python [Video] Statistics for Data Science Python Data Analysis, Second Edition Learning R for Data Visualization [Video] Or pay $15 and get everything above as well as... Big Data Analytics with Hadoop 3 Mastering MongoDB 3.x Artificial Intelligence for Big Data Big Data Architect's Handbook Hadoop Real-World Solutions Cookbook, Second Edition Build scalable applications with Apache Kafka [Video] Learning Apache Cassandra [Video] Data Science Algorithms in a Week Python Data Science Essentials, Second Edition Mastering Tableau 10

0
0
2419

article-image-apache-kafka-2-0-0-has-just-been-released

3 min read

Apache Kafka 2.0.0 has just been released

Apache Kafka, the open source distributed data streaming software, has just hit version 2.0.0. With Kafka becoming a vital component in the (big) data architecture of many organizations, this new major stable release represents an important step in consolidating its importance for data architects and engineers. Quick recap: what is Apache Kafka? If you're not sure what Kafka is, let's just take a moment to revisit what it does before getting into the details of the 2.0.0 release. Essentially, Kafka is a tool that allows you to stream, store and publish data. It's a bit like a message queue system. It's used to either move data between different systems between applications (ie. build data pipelines) or develop applications that react in specific ways to streams of data. Kafka is an important tool because it can process data in real-time. Key to this is the fact it is distributed - things are scaled horizontally, across machines. It's not centralized. As the project website explains, Kafka is "run as a cluster on one or more servers that can span multiple datacenters." What's new in Apache Kafka 2.0.0? There's a huge range of changes and improvements that have gone live with Kafka 2.0.0. All of these are an attempt to give users more security, stability and reliability in their data architecture. It's Kafka doubling down on what it has always tried to do well. Here are a few of the key changes: Security improvements in Kafka 2.0.0 Simplified access control management for large deployments thanks to support for prefixed ACLs. "Bulk access to topics, consumer groups or transactional ids with a prefix can now be granted using a single rule. Access control for topic creation has also been improved to enable access to be granted to create specific topics or topics with a prefix." Encryption is easier to manage - "We now support Java 9, leading, among other things, to significantly faster TLS and CRC32C implementations. Over-the-wire encryption will be faster now, which will keep Kafka fast and compute costs low when encryption is enabled." Easier security configuration - SSL truststores can now be updated without broker restart and security for broker listeners in Zookeeper can be configured before starting brokers too. Reliability improvements in Kafka 2.0.0 Throttling notifications make it easier to distinguish between network errors and when quotas are maxed-out. Improvements to resiliency of brokers "by reducing the memory footprint of message down-conversions." Unit testing Kafka Streams will now be easier thanks to the kafka-streams-testutil artifact. You can read the details about the release here.

0
0
6299

article-image-aws-greengrass-machine-learning-edge

3 min read

AWS Greengrass brings machine learning to the edge

AWS already has solutions for machine learning, edge computing, and IoT. But a recent update to AWS Greengrass has combined all of these facets so you can deploy machine learning models to the edge of networks. That's an important step forward in the IoT space for AWS. With Microsoft also recently announcing a $5 billion investment in IoT projects over the next 4 years, by extending the capability of AWS Greengrass, the AWS team are making sure they set the pace in the industry. Jeff Barr, AWS evangelist, explained the idea in a post on the AWS blog: "...You can now perform Machine Learning inference at the edge using AWS Greengrass. This allows you to use the power of the AWS cloud (including fast, powerful instances equipped with GPUs) to build, train, and test your ML models before deploying them to small, low-powered, intermittently-connected IoT devices running in those factories, vehicles, mines, fields..." Industrial applications of machine learning inference Machine learning inference is bringing lots of advantages to industry and agriculture. For example: In farming, edge-enabled machine learning systems will be able to monitor crops using image recognition - in turn this will enable corrective action to be taken, allowing farmers to optimize yields. In manufacturing, machine learning inference at the edge should improve operational efficiency by making it easier to spot faults before they occur. For example, by monitoring vibrations or noise levels, Barr explains, you'll be able to identify faulty or failing machines before they actually break. Running this on AWS greengrass offers a number of advantages over running machine learning models and processing data locally - it means you can run complex models without draining your computing resources. Read more in detail on the AWS Greengrass Developer Guide. AWS Greengrass should simplify machine learning inference One of the fundamental benefits of using AWS Greengrass should be that it simplifies machine learning inference at every single stage of the typical machine learning workflow. From building and deploying machine learning models, to developing inference applications that can be launched locally within an IoT network, it should, in theory, make the advantages of machine learning inference more accessible to more people. It will be interesting to see how this new feature is applied by IoT engineers over the next year or so. But it will also be interesting to see if this has any impact on the wider battle for the future of Industrial IoT. Further reading: What is edge computing? AWS IoT Analytics: The easiest way to run analytics on IoT data, Amazon says What you need to know about IoT product development

0
0
2942

article-image-handpicked-weekend-reading-1st-dec-2017

Aarthi Kumaraswamy

1 min read

Handpicked for your weekend Reading - 1st Dec 2017

Aarthi Kumaraswamy

Expert in Focus: Sebastian Raschka On how Machine Learning has become more accessible 3 Things that happened this week in Data Science News Data science announcements at Amazon re:invent 2017 IOTA, the cryptocurrency that uses Tangle instead of blockchain, announces Data Marketplace for Internet of Things Cloudera Altus Analytic DB: Modernizing the cloud-based data warehouses Get hands-on with these Tutorials Building a classification system with logistic regression in OpenCV How to build a Scatterplot in IBM SPSS Do you agree with these Insights & Opinions? Highest Paying Data Science Jobs in 2017 5 Ways Artificial Intelligence is Transforming the Gaming Industry 10 Algorithms every Machine Learning Engineer should know

0
0
1401

article-image-google-joins-social-coding-colaboratory

3 min read

Google joins the social coding movement with CoLaboratory

Google has made it quite accessible for people to collaborate their documents, spreadsheets, and so on, with the Google Drive feature. What next? If you are one of those data science nerds who love coding, this roll-out from Google would be an amazing experimental ground for you. Google released its coLaboratory project, a new tool, and a boon for data science and analysis. It is designed in a way to make collaborating on data easier; similar to a Google document. This means it is capable of running code and providing simultaneous output within the document itself. Collaboration is what sets coLaboratory apart. It allows an improved collaboration among people having distinct skill sets--one may be great at coding, while the other might be well aware of the front-end or GUI aspects of the project. Just as you store and share a Google document or spreadsheets, you can store and share code with coLaboratory notebooks, in Google Drive. All you have to do is, click on the 'Share' option at the top right of any coLaboratory notebook. You can also look up to the Google Drive file sharing instructions. Thus, it sets new improvements for the ad-hoc workflows without the need of mailing documents back and forth. CoLaboratory includes a Jupyter notebook environment that does not require any setup for using it. With this, one does not need to download, install, or run anything on their computer. All they would need is, just a browser and they can use and share Jupyter notebooks. At present, coLaboratory functions with Python 2.7 on the desktop version of Chrome only. The reason for this is, coLab with Python 2.7 has been an internal tool for Google, for many years. Although, making it available on other browsers and with an added support for other Jupyter Kernels such as R or Scala is on the cards, soon. CoLaboratory’s GitHub repository contains two dependent tools, which one can make use of to leverage the tool onto the browser. First is the coLaboratory Chrome App and the other is coLaboratory with Classic Jupyter Kernels. Both tools can be used for creating and storing notebooks within Google Drive. This allows a collaborative editing within the notebooks. The only difference is that Chrome App executes all the code within its browser using the PNaCl Sandbox. Whereas, the CoLaboratory classic code execution is done using the local Jupyter kernels (IPython kernel) that have a complete access to the host systems and files. The coLaboratory Chrome App aids in setting up a collaborative environment for data analysis. This can be a hurdle at times, as requirements vary among different machines and operating systems. Also, the installation errors can be cryptic too. However, just with a single click, coLaboratory, IPython and a large set of popular scientific python libraries can be installed. Also, because of the Portable Native Client (PNaCl), coLaboratory is secure and runs at local speeds. This allows new users to set out on exploring IPython at a faster speed. Here’s what coLaboratory brings about for the code-lovers: No additional installation required the browser does it all The capabilities of coding now within a document Storing and sharing the notebooks on Google Drive Real-time collaboration possible; no fuss of mailing documents to and fro You can find a detailed explanation of the tool on GitHub.

0
0
2546

article-image-spark-h2o-sparkling-water-machine-learning-needs

Aarthi Kumaraswamy

3 min read

Spark + H2O = Sparkling water for your machine learning needs

Aarthi Kumaraswamy

[box type="note" align="" class="" width=""]The following is an excerpt from the book Mastering Machine Learning with Spark, Chapter 1, Introduction to Large-Scale Machine Learning and Spark written by Alex Tellez, Max Pumperla, and Michal Malohlava. This article introduces Sparkling water - H2O's integration of their platform within the Spark project, which combines the machine learning capabilities of H2O with all the functionality of Spark. [/box] H2O is an open source, machine learning platform that plays extremely well with Spark; in fact, it was one of the first third-party packages deemed "Certified on Spark". Sparkling Water (H2O + Spark) is H2O's integration of their platform within the Spark project, which combines the machine learning capabilities of H2O with all the functionality of Spark. This means that users can run H2O algorithms on Spark RDD/DataFrame for both exploration and deployment purposes. This is made possible because H2O and Spark share the same JVM, which allows for seamless transitions between the two platforms. H2O stores data in the H2O frame, which is a columnar-compressed representation of your dataset that can be created from Spark RDD and/or DataFrame. Throughout much of this book, we will be referencing algorithms from Spark's MLlib library and H2O's platform, showing how to use both the libraries to get the best results possible for a given task. The following is a summary of the features Sparkling Water comes equipped with: Use of H2O algorithms within a Spark workflow Transformations between Spark and H2O data structures Use of Spark RDD and/or DataFrame as inputs to H2O algorithms Use of H2O frames as inputs into MLlib algorithms (will come in handy when we do feature engineering later) Transparent execution of Sparkling Water applications on top of Spark (for example, we can run a Sparkling Water application within a Spark stream) The H2O user interface to explore Spark data Design of Sparkling Water Sparkling Water is designed to be executed as a regular Spark application. Consequently, it is launched inside a Spark executor created after submitting the application. At this point, H2O starts services, including a distributed key-value (K/V) store and memory manager, and orchestrates them into a cloud. The topology of the created cloud follows the topology of the underlying Spark cluster. As stated previously, Sparkling Water enables transformation between different types of RDDs/DataFrames and H2O's frame, and vice versa. When converting from a hex frame to an RDD, a wrapper is created around the hex frame to provide an RDD-like API. In this case, data is not duplicated but served directly from the underlying hex frame. Converting from an RDD/DataFrame to a H2O frame requires data duplication because it transforms data from Spark into H2O-specific storage. However, data stored in an H2O frame is heavily compressed and does not need to be preserved as an RDD anymore: If you enjoyed this excerpt, be sure to check out the book it appears in.

0
0
2840

article-image-storm-real-time-analytics

Amarabha Banerjee

8 min read

Getting started with Storm Components for Real Time Analytics

Amarabha Banerjee

[box type="note" align="" class="" width=""]In this article by Shilpi Saxena and Saurabh Gupta from their book Practical Real-time data Processing and Analytics we shall explore Storm's architecture with its components and configure it to run in a cluster. [/box] Initially, real-time processing was implemented by pushing messages into a queue and then reading the messages from it using Python or any other language to process them one by one. The primary challenges with this approach were: In case of failure of the processing of any message, it has to be put back in to queue for reprocessing Keeping queues and the worker (processing unit) up and running all the time Below are the two main reasons that make Storm a highly reliable real-time engine: Abstraction: Storm is distributed abstraction in the form of Streams. A Stream can be produced and processed in parallel. Spout can produce new Stream and Bolt is a small unit of processing on stream. Topology is the top level abstraction. The advantage of abstraction here is that nobody must be worried about what is going on internally, like serialization/deserialization, sending/receiving message between different processes, and so on. The user must be focused on writing the business logic. A guaranteed message processing algorithm: Nathan Marz developed an algorithm based on random numbers and XORs that would only require about 20 bytes to track each spout tuple, regardless of how much processing was triggered downstream. Storm Architecture and Storm components The nimbus node acts as the master node in a Storm cluster. It is responsible for analyzing topology and distributing tasks on different supervisors as per the availability. Also, it monitors failure; in the case that one of the supervisors dies, it then redistributes the tasks among available supervisors. Nimbus node uses Zookeeper to keep track of tasks to maintain the state. In case of Nimbus node failure, it can be restarted which reads the state from Zookeeper and start from the same point where it failed earlier. Supervisors act as slave nodes in the Storm cluster. One or more workers, that is, JVM processes, can run in each supervisor node. A supervisor co-ordinates with workers to complete the tasks assigned by nimbus node. In the case of worker process failure, the supervisor finds available workers to complete the tasks. A worker process is a JVM running in a supervisor node. It has executors. There can be one or more executors in the worker process. Worker co-ordinates with executor to finish up the task. An executor is single thread process spawned by a worker. Each executor is responsible for running one or more tasks. A task is a single unit of work. It performs actual processing on data. It can be either Spout or Bolt. Apart from above processes, there are two important parts of a Storm cluster; they are logging and Storm UI. The logviewer service is used to debug logs for workers at supervisors on Storm UI. The following are the primary characteristics of Storm that make it special and ideal for real-time processing. Fast Reliable Fault-Tolerant Scalable Programming Language Agnostic Strom Components Tuple: It is the basic data structure of Storm. It can hold multiple values and data type of each value can be different. Topology: As mentioned earlier, topology is the highest level of abstraction. It contains the flow of processing including spout and bolts. It is kind of graph computation. Stream: The stream is core abstraction of Storm. It is a sequence of unbounded tuples. A stream can be processed by the different type of bolts and which results into a new stream. Spout: Spout is a source of stream. It reads messages from sources like Kafka, RabbitMQ, and so on as tuples and emits them in a stream. There are two types of Spout Reliable: Spout keeps track of each tuple and replay tuple in case of any failure. Unreliable: Spout does not care about the tuple once it is emitted as a stream to another bolt or spout. Setting up and configuring Storm Before setting up Storm, we need to setup Zookeeper which is required by Storm: Setting up Zookeeper Below are instructions on how to install, configure and run Zookeeper in standalone and cluster mode: Installing Download Zookeeper from http://www-eu.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz. After the download, extract zookeeper-3.4.6.tar.gz as below: tar -xvf zookeeper-3.4.6.tar.gz The following files and folders will be extracted: Configuring There are two types of deployment with Zookeeper; they are standalone and cluster. There is no big difference in configuration, just new extra parameters for cluster mode. Standalone As shown, in the previous figure, go to the conf folder and change the zoo.cfg file as follows: tickTime=2000 # Length of single tick in milliseconds. It is used to # regulate heartbeat and timeouts. initLimit=5 # Amount of time to allow followers to connect and sync # with leader. syncLimit=2 # Amount of time to allow followers to sync with # Zookeeper dataDir=/tmp/zookeeper/tmp # Directory where Zookeeper keeps # transaction logs clientPort=2182 # Listening port for client to connect. maxClientCnxns=30 # Maximum limit of client to connect to Zookeeper # node. Cluster In addition to above configuration, add the following configuration to the cluster as well: server.1=zkp-1:2888:3888 server.2=zkp-2:2888:3888 server.3=zkp-3:2888:3888 server.x=[hostname]nnnn:mmmm : Here x is id assigned to each Zookeeper node. In datadir, configured above, create a file "myid" and put corresponding ID of Zookeeper in it. It should be unique across the cluster. The same ID is used as x here. Nnnn is the port used by followers to connect with leader node and mmmm is the port used for leader election. Running Use the following command to run Zookeeper from the Zookeeper home dir: /bin/zkServer.sh start The console will come out after the below message and the process will run in the background. Starting zookeeper ... STARTED The following command can be used to check the status of Zookeeper process: /bin/zkServer.sh status The following output would be in standalone mode: Mode: standalone The following output would be in cluster mode: Mode: follower # in case of follower node Mode: leader # in case of leader node Setting up Apache Storm Below are instructions on how to install, configure and run Storm with nimbus and supervisors. Installing Download Storm from http://www.apache.org/dyn/closer.lua/storm/apache-storm-1.0.3/apache-storm-1.0.3.tar.gz. After the download, extract apache-storm-1.0.3.tar.gz, as follows: tar -xvf apache-storm-1.0.3.tar.gz Below are the files and folders that will be extracted: Configuring As shown, in the previous figure, go to the conf folder and add/edit properties in storm.yaml: Set the Zookeeper hostname in the Storm configuration: storm.zookeeper.servers: - "zkp-1" - "zkp-2" - "zkp-3" Set the Zookeeper port: storm.zookeeper.port: 2182 Set the Nimbus node hostname so that storm supervisor can communicate with it: nimbus.host: "nimbus" Set Storm local data directory to keep small information like conf, jars, and so on: storm.local.dir: "/usr/local/storm/tmp" Set the number of workers that will run on current the supervisor node. It is best practice to use the same number of workers as the number of cores in the machine. supervisor.slots.ports: - 6700 - 6701 - 6702 - 6703 - 6704 - 6705 Perform memory allocation to the worker, supervisor, and nimbus: worker.childopts: "-Xmx1024m" nimbus.childopts: "-XX:+UseConcMarkSweepGC – XX:+UseCMSInitiatingOccupancyOnly – XX_CMSInitiatingOccupancyFraction=70" supervisor.childopts: "-Xmx1024m" Topologies related configuration: The first configuration is to configure the maximum amount of time (in seconds) for a tuple's tree to be acknowledged (fully processed) before it is considered failed. The second configuration is that Debug logs are false, so Storm will generate only info logs. topology.message.timeout.secs: 60 topology.debug: false Running There are four services needed to start a complete Storm cluster: Nimbus: First of all, we need to start Nimbus service in Storm. The following is the command to start it: /bin/storm nimbus Supervisor: Next, we need to start supervisor nodes to connect with the nimbus node. The following is the command: /bin/storm supervisor UI: To start Storm UI, execute the following command: /bin/storm ui You can access UI on http://nimbus-host:8080. It is shown in following figure. Logviewer: Log viewer service helps to see the worker logs in the Storm UI. Execute the following command to start it: /bin/storm logviewer Summary We started with the history of Storm, where we discussed how Nathan Marz the got idea for Storm and what type of challenges he faced while releasing Storm as open source software and then in Apache. We discussed the architecture of Storm and its components. Nimbus, supervisor worker, executors, and tasks are part of Storm's architecture. Its components are tuple, stream, topology, spout, and bolt. We discussed how to set up Storm and configure it to run in the cluster. Zookeeper is required to be set up first, as Storm requires it. The above was an excerpt from the book Practical Real-time data Processing and Analytics

0
0
3330

article-image-near-real-time-nrt-applications-work

Amarabha Banerjee

6 min read

How Near Real Time (NRT) Applications work

Amarabha Banerjee

[box type="note" align="" class="" width=""]In this article by Shilpi Saxena and Saurabh Gupta from their book Practical Real-time data Processing and Analytics we shall explore what a near real time architecture looks like and how an NRT app works. [/box] It's very important to understand the key aspects where the traditional monolithic application systems are falling short to serve the need of the hour: Backend DB: Single point monolithic data access. Ingestion flow: The pipelines are complex and tend to induce latency in end to end flow. Systems are failure prone, but the recovery approach is difficult and complex. Synchronization and state capture: It's very difficult to capture and maintain the state of facts and transactions in the system. Getting diversely distributed systems and real-time system failures further complicate the design and maintenance of such systems. The answer to the above issues is an architecture that supports streaming and thus provides its end users access to actionable insights in real-time over ever flowing in-streams of real-time fact data. Local state and consistency of the system for large scale high velocity systems Data doesn't arrive at intervals, it keeps flowing in, and it's streaming in all the time No single state of truth in the form of backend database, instead the applications subscribe or tap into stream of fact data Before we delve further, it's worthwhile to understand the notation of time: Looking at this figure, it's very clear to correlate the SLAs with each type of implementation (batch, near real-time, and real-time) and the kinds of use cases each implementation caters to. For instance, batch implementations have SLAs ranging from a couple of hours to days and such solutions are predominantly deployed for canned/pre-generated reports and trends. The real-time solutions have an SLA of a magnitude of few seconds to hours and cater to situations requiring ad-hoc queries, mid-resolution aggregators, and so on. The real-time application's most mission-critical in terms of SLA and resolutions are where each event accounts for and the results have to return within an order of milliseconds to seconds. Near real time (NRT) Architecture In its essence, NRT Architecture consists of four main components/layers, as depicted in the following figure: The message transport pipeline The stream processing component The low-latency data store Visualization and analytical tools The first step is the collection of data from the source and providing for the same to the "data pipeline", which actually is a logical pipeline that collects the continuous events or streaming data from various producers and provides the same to the consumer stream processing applications. These applications transform, collate, correlate, aggregate, and perform a variety of other operations on this live streaming data and then finally store the results in the low-latency data store. Then, there is a variety of analytical, business intelligence, and visualization tools and dashboards that read this data from the data store and present it to the business user. Data collection This is the beginning of the journey of all data processing, be it batch or real time the foremost and most forthright is the challenge to get the data from its source to the systems for our processing. If I can look at the processing unit as a black box and a data source, and at consumers as publishers and subscribers. It's captured in the following diagram: The key aspects that come under the criteria for data collection tools in the general context of big data and real-time specifically are as follows: Performance and low latency Scalability Ability to handle structured and unstructured data Apart from this, the data collection tool should be able to cater to data from a variety of sources such as: Data from traditional transactional systems: To duplicate the ETL process of these traditional systems and tap the data from the source Tap the data from these ETL systems The third and a better approach is to go the virtual data lake architecture for data replication. Structured data from IoT/ Sensors/Devices, or CDRs: This is the data that comes at a very high velocity and in a fixed format – the data can be from a variety of sensors and telecom devices. Unstructured data from media files, text data, social media, and so on: This is the most complex of all incoming data where the complexity is due to the dimensions of volume, velocity, variety, and structure. Stream processing The stream processing component itself consists of three main sub-components, which are: The Broker: that collects and holds the events or data streams from the data collection agents. The "Processing Engine": that actually transforms, correlates, aggregates the data, and performs the other necessary operations The "Distributed Cache": that actually serves as a mechanism for maintaining common data set across all distributed components of the processing engine The same aspects of the stream processing component are zoomed out and depicted in the diagram as follows: There are few key attributes that should be catered to by the stream processing component: Distributed components thus offering resilience to failures Scalability to cater to growing need of the application or sudden surge of traffic Low latency to handle the overall SLAs expected from such application Easy operationalization of use case to be able to support the evolving use cases Build for failures, the system should be able to recover from inevitable failures without any event loss, and should be able to reprocess from the point it failed Easy integration points with respect to off-heap/distributed cache or data stores A wide variety of operations, extensions, and functions to work with business requirements of the use case Analytical layer - serve it to the end user The analytical layer is the most creative and interesting of all the components of an NRT application. So far, all we have talked about is backend processing, but this is the layer where we actually present the output/insights to the end user graphically, visually in form of an actionable item. A few of the challenges these visualization systems should be capable of handling are: Need for speed Understanding the data and presenting it in the right context Dealing with outliers The figure depicts the flow of information from event producers to the collection agents, followed by the brokers and processing engine (transformation, aggregation, and so on) and then the long-term storage. From the storage unit, the visualization tools reap the insights and present them in form of graphs, alerts, charts, Excel sheets, dashboards, or maps, to the business owners who can assimilate the information and take some action based upon it. The above was an excerpt from the book Practical Real-time data Processing and Analytics.

0
0
3407

article-image-soft-skills-data-scientists-teach-child

7 min read

Soft skills every data scientist should teach their child

Data Scientists work really hard to upskill their technical competencies. A rapidly changing technology landscape demands a continuous ramp up of skills like mastering a new programming language like R, Python, Java or something else, exploring new machine learning frameworks and libraries like TensorFlow or Keras, understanding cutting-edge algorithms like Deep Convolutional Networks and K-Means to name a few. Had they lived in Dr.Frankenstein's world, where scientists worked hard in their labs, cut-off from the rest of the world, this should have sufficed. But in the real world, data scientists use data and work with people to solve real-world problems for people. They need to learn something more, that forms a bridge between their ideas/hypotheses and the rest of the world. Something that’s more of an art than a skill these days. We’re talking about soft-skills for data scientists. Today we’ll enjoy a conversation between a father and son, as we learn some critical soft-skills for data scientists necessary to make it big in the data science world. [box type="shadow" align="" class="" width=""] One chilly evening, Tommy is sitting with his dad on their grassy backyard with the radio on, humming along to their favourite tunes. Tommy, gazing up at the sky for a while, asks his dad, “Dad, what are clouds made of?” Dad takes a sip of beer and replies, “Mostly servers, son. And tonnes of data.” Still gazing up, Tommy takes a deep breath, pondering about what his dad just said. Tommy: Tell me something, what’s the most important thing you’ve learned in your career as a Data Scientist? Dad smiles: I’m glad you asked, son. I’m going to share something important with you. Something I have learned over all these years crunching and munching data. I want you to keep this to yourself and remember it for as long as you can, okay? Tommy: Yes dad. Dad: Atta boy! Okay, the first thing you gotta do if you want to be successful, is you gotta be curious! Data is everywhere and it can tell you a lot. But if you’re not curious to explore data and tackle it from every angle, you will remain mediocre at best. Have an open mind - look at things through a kaleidoscope and challenge assumptions and presumptions. Innovation is the key to making the cut as a data scientist. Tommy nods his head approvingly. Dad, satisfied that Tommy is following along, continues. Dad: One of the most important skills a data scientist should possess is a great business acumen. Now, I know you must be wondering why one would need business acumen when all they’re doing is gathering a heap of data and making sense of it. Tommy looks straight-faced at his dad. Dad: Well, a data scientist needs to know the business like the back of their hand because unless they do, they won’t understand what the business’ strengths and weaknesses are and how data can contribute towards boosting its success. They need to understand where the business fits into the industry and what it needs to do to remain competitive. Dad’s last statement is rewarded by an energetic, affirmative nod from Tommy. Smiling, dad’s quite pleased with the response. Dad: Communication is next on the list. Without a clever tongue, a data scientist will find himself going nowhere in the tech world. Gone are the days when technical knowledge was all that was needed to sustain. A data scientist’s job is to help a business make critical, data-driven decisions. Of what use is it to the non-technical marketing or sales teams, if the data scientist can’t communicate his/her insights in a clear and effective way? A data scientist must also be a good listener to truly understand what the problem is to come up with the right solution. Tommy leans back in his chair, looking up at the sky again, thinking how he would communicate insights effectively. Dad continues: Very closely associated with communication, is the ability to present well, or as a data scientist would put it - tell tales that inspire action. Now a data scientist might have to put forward their findings before an entire board of directors, who will be extremely eager to know why they need to take a particular decision and how it will benefit the organization. Here, clear articulation, a knack for storytelling and strong convincing skills are all important for the data scientist to get the message across in the best way. Tommy quips: Like the way you convince mom to do the dishes every evening? Dad playfully punches Tommy: Hahaha, you little rascal! Tommy: Are there any more skills a data scientist needs to possess to excel at what they do? Dad: Indeed, there are! True data science is a research activity, where problems with unclear or unobvious solutions get solved. There are times when even the nature of the problem isn’t clear. A data scientist should be skilled at performing their own independent research - snooping around for information or data, gathering it and preparing it for further analysis. Many organisations look for people with strong research capabilities, before they recruit them. Tommy: What about you? Would you recruit someone without a research background? Dad: Well, personally no. But that doesn’t mean I would only hire someone if they were a PhD. Even an MSc would do, if they were able to justify their research project, and convince me that they’re capable of performing independent research. I wouldn’t hesitate to take them on board. Here’s where I want to share one of the most important skills I’ve learned in all my years. Any guesses on what it might be? Tommy: Hiring? Dad: Ummmmm… I’ll give this one to you ‘cos it’s pretty close. The actual answer is, of course, a much broader term - ‘management’. It encompasses everything from hiring the right candidates for your team to practically doing everything that a person handling a team does. Tommy: And what’s that? Dad: Well, as a senior data scientist, one would be expected to handle a team of lesser experienced data scientists, managing, mentoring and helping them achieve their goals. It’s a very important skill to hone, as you climb up the ladder. Some learn it through experience, others learn it by taking management courses. Either way, this skill is important for one to succeed in a senior role. And, that’s about all I have for now. I hope at least some of this benefits you, as you step into your first job tomorrow. Tommy smiles: Yeah dad, it’s great to have someone in the same line of work to look up to when I’m just starting out my career. I’m glad we had this conversation. Holding up an empty can, he says, “I’m out, toss me another beer, please.”[/box] Soft Skills for Data Scientists - A quick Recap In addition to keeping yourself technically relevant, to succeed as a data scientist you need to Be curious: Explore data from different angles, question the granted - assumptions & presumptions. Have strong business acumen: Know your customer, know your business, know your market. Communicate effectively: Speak the language of your audience, listen carefully to understand the problem you want to solve. Master the art of presenting well: Tell stories that inspire action, get your message across through a combination of data storytelling, negotiation and persuasion skills Be a problem solver: Do your independent research, get your hands dirty and dive deep for answers. Develop your management capabilities: Manage, mentor and help other data scientists reach their full potential.

0
0
2696

article-image-data-scientist-sexiest-role-21st-century

Aarthi Kumaraswamy

6 min read

Data Scientist: The sexiest role of the 21st century

Aarthi Kumaraswamy

"Information is the oil of the 21st century, and analytics is the combustion engine." -Peter Sondergaard, Gartner Research By 2018, it is estimated that companies will spend $114 billion on big data-related projects, an increase of roughly 300%, compared to 2013 (https://www.capgemini-consulting.com/resource-file-access/resource/pdf/big_dat a_pov_03-02-15.pdf). Much of this increase in expenditure is due to how much data is being created and how we are better able to store such data by leveraging distributed filesystems such as Hadoop. However, collecting the data is only half the battle; the other half involves data extraction, transformation, and loading into a computation system, which leverages the power of modern computers to apply various mathematical methods in order to learn more about data and patterns and extract useful information to make relevant decisions. The entire data workflow has been boosted in the last few years by not only increasing the computation power and providing easily accessible and scalable cloud services (for example, Amazon AWS, Microsoft Azure, and Heroku) but also by a number of tools and libraries that help to easily manage, control, and scale infrastructure and build applications. Such a growth in the computation power also helps to process larger amounts of data and to apply algorithms that were impossible to apply earlier. Finally, various computation- expensive statistical or machine learning algorithms have started to help extract nuggets of information from data. Finding a uniform definition of data science is akin to tasting wine and comparing flavor profiles among friends—everyone has their own definition and no one description is more accurate than the other. At its core, however, data science is the art of asking intelligent questions about data and receiving intelligent answers that matter to key stakeholders. Unfortunately, the opposite also holds true—ask lousy questions of the data and get lousy answers! Therefore, careful formulation of the question is the key for extracting valuable insights from your data. For this reason, companies are now hiring data scientists to help formulate and ask these questions. At first, it's easy to paint a stereotypical picture of what a typical data scientist looks like: t- shirt, sweatpants, thick-rimmed glasses, and debugging a chunk of code in IntelliJ... you get the idea. Aesthetics aside, what are some of the traits of a data scientist? One of our favorite posters describing this role is shown here in the following diagram: Math, statistics, and general knowledge of computer science is given, but one pitfall that we see among practitioners has to do with understanding the business problem, which goes back to asking intelligent questions of the data. It cannot be emphasized enough: asking more intelligent questions of the data is a function of the data scientist's understanding of the business problem and the limitations of the data; without this fundamental understanding, even the most intelligent algorithm would be unable to come to solid conclusions based on a wobbly foundation. A day in the life of a data scientist This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail! So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality. Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows: If you can open your dataset using Excel, you are working with small data. Working with big data What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names: B D X A D A We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq: bash> sort file | uniq -c The output is as shown ahead: 2 A 1 B 1 D 1 X However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated. The above is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. If you would like to learn how to solve the above problem and other cool machine learning tasks a data scientist carries out such as the following, check out the book. Use Spark streams to cluster tweets online Run the PageRank algorithm to compute user influence Perform complex manipulation of DataFrames using Spark Define Spark pipelines to compose individual data transformations Utilize generated models for off-line/on-line prediction

0
0
4692

article-image-4-clustering-algorithms-every-data-scientist-know

Sugandha Lahoti

6 min read

4 Clustering Algorithms every Data Scientist should know

Sugandha Lahoti

[box type="note" align="" class="" width=""]This is an excerpt from a book by John R. Hubbard titled Java Data Analysis. In this article, we see the four popular clustering algorithms: hierarchical clustering, k-means clustering, k-medoids clustering, and the affinity propagation algorithms along with their pseudo-codes.[/box] A clustering algorithm is one that identifies groups of data points according to their proximity to each other. These algorithms are similar to classification algorithms in that they also partition a dataset into subsets of similar points. But, in classification, we already have data whose classes have been identified. such as sweet fruit. In clustering, we seek to discover the unknown groups themselves. Hierarchical clustering Of the several clustering algorithms that we will examine in this article, hierarchical clustering is probably the simplest. The trade-off is that it works well only with small datasets in Euclidean space. The general setup is that we have a dataset S of m points in Rn which we want to partition into a given number k of clusters C1 , C2 ,..., Ck , where within each cluster the points are relatively close together. Here is the algorithm: Create a singleton cluster for each of the m data points. Repeat m – k times: Find the two clusters whose centroids are closest Replace those two clusters with a new cluster that contains their points The centroid of a cluster is the point whose coordinates are the averages of the corresponding coordinates of the cluster points. For example, the centroid of the cluster C = {(2, 4), (3, 5), (6, 6), (9, 1)} is the point (5, 4), because (2 + 3 + 6 + 9)/4 = 5 and (4 + 5 + 6 + 1)/4 = 4. This is illustrated in the figure below : K-means clustering A popular alternative to hierarchical clustering is the K-means algorithm. It is related to the K-Nearest Neighbor (KNN) classification algorithm. As with hierarchical clustering, the K-means clustering algorithm requires the number of clusters, k, as input. (This version is also called the K-Means++ algorithm) Here is the algorithm: Select k points from the dataset. Create k clusters, each with one of the initial points as its centroid. For each dataset point x that is not already a centroid: Find the centroid y that is closest to x Add x to that centroid’s cluster Re-compute the centroid for that cluster It also requires k points, one for each cluster, to initialize the algorithm. These initial points can be selected at random, or by some a priori method. One approach is to run hierarchical clustering on a small sample taken from the given dataset and then pick the centroids of those resulting clusters. K-medoids clustering The k-medoids clustering algorithm is similar to the k-means algorithm, except that each cluster center, called its medoid, is one of the data points instead of being the mean of its points. The idea is to minimize the average distances from the medoids to points in their clusters. The Manhattan metric is usually used for these distances. Since those averages will be minimal if and only if the distances are, the algorithm is reduced to minimizing the sum of all distances from the points to their medoids. This sum is called the cost of the configuration. Here is the algorithm: Select k points from the dataset to be medoids. Assign each data point to its closest medoid. This defines the k clusters. For each cluster Cj : Compute the sum s = ∑ j s j , where each sj = ∑{ d (x, yj) : x ∈ Cj } , and change the medoid yj to whatever point in the cluster Cj that minimizes s If the medoid yj was changed, re-assign each x to the cluster whose medoid is closest Repeat step 3 until s is minimal. This is illustrated by the simple example in Figure 8.16. It shows 10 data points in 2 clusters. The two medoids are shown as filled points. In the initial configuration it is: C1 = {(1,1),(2,1),(3,2) (4,2),(2,3)}, with y1 = x1 = (1,1) C2 = {(4,3),(5,3),(2,4) (4,4),(3,5)}, with y2 = x10 = (3,5) The sums are s1 = d (x2,y1) + d (x3,y1) + d (x4,y1) + d (x5,y1) = 1 + 3 + 4 + 3 = 11 s2 = d (x6,y1) + d (x7,y1) + d (x8,y1) + d (x9,y1) = 3 + 4 + 2 + 2 = 11 s = s1 + s2 = 11 + 11 = 22 The algorithm at step 3 first part changes the medoid for C1 to y1 = x3 = (3,2). This causes the clusters to change, at step 3 second part, to: C1 = {(1,1),(2,1),(3,2) (4,2),(2,3),(4,3),(5,3)}, with y1 = x3 = (3,2) C2 = {(2,4),(4,4),(3,5)}, with y2 = x10 = (3,5) This makes the sums: s1 = 3 + 2 + 1 + 2 + 2 + 3 = 13 s2 = 2 + 2 = 4 s = s1 + s2 = 13 + 4 = 17 The resulting configuration is shown in the second panel of the figure below: At step 3 of the algorithm, the process repeats for cluster C2. The resulting configuration is shown in the third panel of the above figure. The computations are: C1 = {(1,1),(2,1),(3,2) (4,2),(4,3),(5,3)}, with y1 = x3 = (3,2) C2 = {(2,3),(2,4),(4,4),(3,5)}, with y2 = x8 = (2,4) s = s1 + s2 = (3 + 2 + 1 + 2 + 3) + (1 + 2 + 2) = 11 + 5 = 16 The algorithm continues with two more changes, finally converging to the minimal configuration shown in the fifth panel of the above figure. This version of k-medoid clustering is also called partitioning around medoids (PAM). Affinity propagation clustering One disadvantage of each of the clustering algorithms previously presented (hierarchical, k-means, k-medoids) is the requirement that the number of clusters k be determined in advance. The affinity propagation clustering algorithm does not have that requirement. Developed in 2007 by Brendan J. Frey and Delbert Dueck at the University of Toronto, it has become one of the most widely-used clustering methods. Like k-medoid clustering, affinity propagation selects cluster center points, called exemplars, from the dataset to represent the clusters. This is done by message-passing between the data points. The algorithm works with three two-dimensional arrays: sij = the similarity between xi and xj rik = responsibility: message from xi to xk on how well-suited xk is as an exemplar for xi aik = availability: message from xk to xi on how well-suited xk is as an exemplar for xi Here is the complete algorithm: Initialize the similarities: sij = –d(xi , xj )2 , for i ≠ j; sii = the average of those other sij values 2. Repeat until convergence: Update the responsibilities: rik = sik − max {aij + s ij : j ≠ k} Update the availabilities: aik = min {0, rkk + ∑j { max {0, rjk } : j ≠ i ∧ j ≠ k }}, for i ≠ k; akk = ∑j { max {0, rjk } : j ≠ k } A point xk will be an exemplar for a point xi if aik + rik = maxj {aij + rij}. If you enjoyed this excerpt from the book Java Data Analysis by John R. Hubbard, check out the book to learn how to implement various machine learning algorithms, data visualization and more in Java.

0
0
2425

article-image-real-time-stream-processing

Packt Editorial Staff

10 min read

Stream me up, Scotty!

Packt Editorial Staff

[box type="note" align="aligncenter" class="" width=""]The following is an excerpt from the book Scala and Spark for Big Data Analytics, Chapter 9, Stream me up, Scotty - Spark Streaming written by Md. Rezaul Karim and Sridhar Alla. It explores the big three stream processing paradigms that are in use today. [/box] In today's world of interconnected devices and services, it is hard to spend even a few hours a day without our smartphone to check Facebook, or hail an Uber ride, or tweet something about the burger we just bought, or check the latest news or sports updates on our favorite team. We depend on our phones and Internet, for a lot of things, whether it is to get work done, or just browse, or e-mail a friend. There is simply no way around this phenomenon, and the number and variety of applications and services will only grow over time. As a result, the smart devices are everywhere, and they generate a lot of data all the time. This phenomenon, also broadly referred to as the Internet of Things, has changed the dynamics of data processing forever. Whenever you use any of the services or apps on your iPhone, or Droid or Windows phone, in some shape or form, real-time data processing is at work. Since so much depends on the quality and value of the apps, there is a lot of emphasis on how the various startups and established companies are tackling the complex challenges of SLAs (Service Level Agreements), and usefulness and also the timeliness of the data. One of the paradigms being researched and adopted by organisations and service providers is the building of very scalable, near real-time or real-time processing frameworks on cutting-edge platforms or infrastructure. Everything must be fast and also reactive to changes and failures. You won’t like it if your Facebook updated once every hour or if you received email only once a day; so, it is imperative that data flow, processing, and the usage are all as close to real time as possible. Many of the systems we are interested in monitoring or implementing, generate a lot of data as an indefinite continuous stream of events. As in any data processing system, we have the same fundamental challenges of data collection, storage, and data processing. However, the additional complexity is due to the real-time needs of the platform. In order to collect such indefinite streams of events and then subsequently process all such events to generate actionable insights, we need to use highly scalable specialized architectures to deal with tremendous rates of events. As such, many systems have been built over the decades starting from AMQ, RabbitMQ, Storm, Kafka, Spark, Flink, Gearpump, Apex, and so on. Modern systems built to deal with such large amounts of streaming data come with very flexible and scalable technologies that are not only very efficient but also help realize the business goals much better than before. Using such technologies, it is possible to consume data from a variety of data sources and then use it in a variety of use cases almost immediately or at a later time as needed. Let us talk about what happens when you book an Uber ride on your smartphone to go to the airport. With a few touches on the smartphone screen, you're able to select a point, choose the credit card, make the payment, and book the ride. Once you're done with your transaction, you then get to monitor the progress of your car real-time on a map on your phone. As the car is making its way toward you, you're able to monitor exactly where the car is and you can also make a decision to pick up coffee at the local Starbucks while you're waiting for the car to pick you up. You could also make informed decisions regarding the car and the subsequent trip to the airport by looking at the expected time of arrival of the car. If it looks like the car is going to take quite a bit of time picking you up, and if this poses a risk to the flight you are about to catch, you could cancel the ride and hop in a taxi that just happens to be nearby. Alternatively, if it so happens that the traffic situation is not going to let you reach the airport on time, thus posing a risk to the flight you are due to catch, you also get to make a decision regarding rescheduling or canceling your flight. Now in order to understand how such real-time streaming architectures such as Uber’s Apollo work to provide such invaluable information, we need to understand the basic tenets of streaming architectures. On the one hand, it is very important for a real-time streaming architecture to be able to consume extreme amounts of data at very high rates while, on the other hand, also ensuring reasonable guarantees that the data that is getting ingested is also processed. The following diagram shows a generic stream processing system with a producer putting events into a messaging system while a consumer is reading from the messaging system. Processing of real-time streaming data can be categorized into the following three essential paradigms: At least once processing At most once processing Exactly once processing Let's look at what these three stream processing paradigms mean to our business use cases. While exactly once processing of real-time events is the ultimate nirvana for us, it is very difficult to always achieve this goal in different scenarios. We have to compromise on the property of exactly once processing in cases where the benefit of such a guarantee is outweighed by the complexity of the implementation. Stream Processing Paradigm 1: At least once processing The at least once processing paradigm involves a mechanism to save the position of the last event received only after the event is actually processed and results persisted somewhere so that, if there is a failure and the consumer restarts, the consumer will read the old events again and process them. However, since there is no guarantee that the received events were not processed at all or partially processed, this causes a potential duplication of events as they are fetched again. This results in the behavior that events get processed at least once. At least once is ideally suitable for any application that involves updating some instantaneous ticker or gauge to show current values. Any cumulative sum, counter, or dependency on the accuracy of aggregations (sum, groupBy, and so on) does not fit the use case for such processing simply because duplicate events will cause incorrect results. The sequence of operations for the consumer are as follows: Save results Save offsets Below is an illustration of what happens if there is a failure and consumer restarts. Since the events have already been processed but the offsets have not been saved, the consumer will read from the previous offsets saved, thus causing duplicates. Event 0 is processed twice in the following figure: Stream Processing Paradigm 2: At most once processing The at-most-once processing paradigm involves a mechanism to save the position of the last event received before the event is actually processed and results persisted somewhere so that, if there is a failure and the consumer restarts, the consumer will not try to read the old events again. However, since there is no guarantee that the received events were all processed, this causes potential loss of events as they are never fetched again. This results in the behavior that the events are processed at most once or not processed at all. At most once is ideally suitable for any application that involves updating some instantaneous ticker or gauge to show current values, as well as any cumulative sum, counter, or other aggregation, provided accuracy is not mandatory or the application needs absolutely all events. Any events lost will cause incorrect results or missing results. The sequence of operations for the consumer are as follows: Save offsets Save results Below is an illustration of what happens if there are a failure and the consumer restarts. Since the events have not been processed but offsets are saved, the consumer will read from the saved offsets, causing a gap in events consumed. Event 0 is never processed in the following figure: Stream Processing Paradigm 3: Exactly once processing The Exactly once processing paradigm is similar to the at least once paradigm, and involves a mechanism to save the position of the last event received only after the event has actually been processed and the results persisted somewhere so that, if there is a failure and the consumer restarts, the consumer will read the old events again and process them. However, since there is no guarantee that the received events were not processed at all or were partially processed, this causes a potential duplication of events as they are fetched again. However, unlike the at least once paradigm, the duplicate events are not processed and are dropped, thus resulting in the exactly once paradigm. Exactly once processing paradigm is suitable for any application that involves accurate counters, aggregations, or which, in general, needs every event processed only once and also definitely once (without loss). The sequence of operations for the consumer are as follows: Save results Save offsets The following is illustration shows what happens if there are a failure and the consumer restarts. Since the events have already been processed but offsets have not saved, the consumer will read from the previous offsets saved, thus causing duplicates. Event 0 is processed only once in the following figure because the consumer drops the duplicate event 0: How does the exactly once paradigm drop duplicates? There are two techniques which can help here: Idempotent updates Transactional updates Idempotent updates involve saving results based on some unique ID/key generated so that, if there is a duplicate, the generated unique ID/key will already be in the results (for instance, a database) so that the consumer can drop the duplicate without updating the results. This is complicated as it's not always possible or easy to generate unique keys. It also requires additional processing on the consumer end. Another point is that the database can be separate for results and offsets. Transactional updates save results in batches that have a transaction beginning and a transaction commit phase within so that, when the commit occurs, we know that the events were processed successfully. Hence, when duplicate events are received, they can be dropped without updating results. This technique is even more complicated than the idempotent updates as now we need some transactional data store. Another point is that the database must be the same for results and offsets. You should look into the use case you're trying to build and see if ‘at least once processing’, or ‘at most once processing’, can be reasonably wide and still achieve an acceptable level of performance and accuracy. If you enjoyed this excerpt, be sure to check out the book Scala and Spark for Big Data Analytics it appears in. You will also like this exclusive interview on why Spark is ideal for stream processing with Romeo Kienzler, Chief Data Scientist in the IBM Watson IoT worldwide team and author of Mastering Apache Spark, 2nd Edition.

0
0
2114