Big Data processing technologies provide ways to work with extensive sets of structured, semi-structured, and unstructured data so that values can be derived from Big Data.

There are several tools to process Big Data, but the most well-known are:

  • Apache Hadoop: a collection of tools that provides distributed storage and processing of big data.
  • Apache Hive: a data warehouse for data query and analysis built on top of Hadoop
  • Apache Spark: distributed analytics framework for complex, real-time data analytics

I. Apache Hadoop

1. Hadoop and its Benefits

Is a Java-based software that distributed storage and processing of large datasets across clusters of computers.

Each computer is a node (can be NameNode - the master server and DataNode) and a number of nodes form a cluster.

Hadoop provides a reliable, scalable and cost-effective solution for storing data with no format requirements.

Benefits:

  • Better real-time data-driven decisions: Incorporates emerging data formats not traditionally used in data warehouses.
  • Improved data access and analysis: Provides real-time, self-service access to stakeholders
  • Data offload and consolidation: Optimizes and streamlines costs by consolidating data, including cold data, across the organization.

2. Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) is a storage system for big data that runs on multiple commodity hardware connected through a network.

Main features:

  • Provides scalable and reliable big data storage by partitioning files over multiple nodes.
  • Splits large files across multiple computers, allowing parallel access to them
  • Replicate file blocks on different nodes to prevent data loss

Other benefits:

  • Fast recovery from hardware failures, because HDFS is built to detect faults and automatically recover
  • Access to streaming data, HDFS supports high data throughput rates
  • Accommodation of large datasets, because HDFS can scale to hundreds of nodes, or computers, in a single node
  • Portability, because HDFS is portable across multiple hardware platforms and compatible with many OSes (Java app)

II. Apache Hive

Hive is an open-source data warehouse software for reading, writing and managing large dataset files that are stored directly in HDFS or other data storage systems.

Attributes:

  • Queries have high latency => not suitable for applications that need fast response time.
  • Hive is read-based => not suitable for heavy data writing tasks.
  • Best fit for Data warehouse: ETL, reporting, and analysis
  • Easy access to data via SQL

III. Apache Spark

Spark is a general-purpose data processing engine designed to extract and process large volumes of data for a wide range of applications.

  • Interactive Analytics
  • Streams Processing
  • Machine Learning
  • Data Integration
  • ETL

Key attributes:

  • Has in-memory processing which significantly increases speed of computations
  • Provides interfaces for major programming languages such as Java, Scala, Python, R, and SQL.
  • Can run using its standalone clustering technology
  • Can also run on top of other infrastructures, such as Hadoop
  • Can access data in a large variety of data sources, including HDFS and Hive.
  • Process streaming data fast
  • Performs complex analytics in real-time