Tools and techniques
Let's take a look at different tools and techniques used in Hadoop and Spark for Big Data analytics.
While the Hadoop platform can be used for both storing and processing the data, Spark can be used for processing only by reading data into memory.
The following is a tabular representation of the tools and techniques used in typical Big Data analytics projects:
 |
Tools used |
Techniques used |
---|---|---|
Data collection |
Apache Flume for real-time data collection and aggregation Apache Sqoop for data import and export from relational data stores and NoSQL databases Apache Kafka for the publish-subscribe messaging system General-purpose tools such as FTP/Copy |
Real-time data capture Export Import Message publishing Data APIs Screen scraping |
Data storage and formats |
HDFS: Primary storage of Hadoop HBase: NoSQL database Parquet: Columnar format Avro: Serialization system on Hadoop Sequence File: Binary key-value pairs RC File: First columnar format in Hadoop ORC File: Optimized RC File XML and JSON: Standard data interchange formats Compression formats: Gzip, Snappy, LZO, Bzip2, Deflate, and others Unstructured Text, images, videos, and so on |
Data storage Data archival Data compression Data serialization Schema evolution |
Data transformation and enrichment |
MapReduce: Hadoop's processing framework Spark: Compute engine Hive: Data warehouse and querying Pig: Data flow language Python: Functional programming Crunch, Cascading, Scalding, and Cascalog: Special MapReduce tools |
Data munging Filtering Joining ETL File format conversion Anonymization Re-identification |
Data analytics |
Hive: Data warehouse and querying Pig: Data flow language Tez: Alternative to MapReduce Impala: Alternative to MapReduce Drill: Alternative to MapReduce Apache Storm: Real-time compute engine Spark Core: Spark core compute engine Spark Streaming: Real-time compute engine Spark SQL: For SQL analytics SolR: Search platform Apache Zeppelin: Web-based notebook Jupyter Notebooks Databricks cloud Apache NiFi: Data flow Spark-on-HBase connector Programming languages: Java, Scala, and Python |
Online Analytical Processing (OLAP) Data mining Data visualization Complex event processing Real-time stream processing Full text search Interactive data analytics |
Data science |
Python: Functional programming R: Statistical computing language Mahout: Hadoop's machine learning library MLlib: Spark's machine learning library GraphX and GraphFrames: Spark's graph processing framework and DataFrame adoption to graphs. |
Predictive analytics Sentiment analytics Text and Natural Language Processing Network analytics Cluster analytics |