Big Data processing technologies provide ways to work with extensive sets of structured, semi-structured, and unstructured data so that values can be derived from Big Data.

There are several tools to process Big Data, but the most well-known are:

Apache Hadoop: a collection of tools that provides distributed storage and processing of big data.
Apache Hive: a data warehouse for data query and analysis built on top of Hadoop
Apache Spark: distributed analytics framework for complex, real-time data analytics

I. Apache Hadoop

1. Hadoop and its Benefits

Is a Java-based software that distributed storage and processing of large datasets across clusters of computers.

Each computer is a node (can be NameNode - the master server and DataNode) and a number of nodes form a cluster.

Hadoop provides a reliable, scalable and cost-effective solution for storing data with no format requirements.

Benefits:

Better real-time data-driven decisions: Incorporates emerging data formats not traditionally used in data warehouses.
Improved data access and analysis: Provides real-time, self-service access to stakeholders
Data offload and consolidation: Optimizes and streamlines costs by consolidating data, including cold data, across the organization.

2. Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) is a storage system for big data that runs on multiple commodity hardware connected through a network.

Main features:

Provides scalable and reliable big data storage by partitioning files over multiple nodes.
Splits large files across multiple computers, allowing parallel access to them
Replicate file blocks on different nodes to prevent data loss

Other benefits:

Fast recovery from hardware failures, because HDFS is built to detect faults and automatically recover
Access to streaming data, HDFS supports high data throughput rates
Accommodation of large datasets, because HDFS can scale to hundreds of nodes, or computers, in a single node
Portability, because HDFS is portable across multiple hardware platforms and compatible with many OSes (Java app)

II. Apache Hive

Hive is an open-source data warehouse software for reading, writing and managing large dataset files that are stored directly in HDFS or other data storage systems.

Attributes:

Queries have high latency => not suitable for applications that need fast response time.
Hive is read-based => not suitable for heavy data writing tasks.
Best fit for Data warehouse: ETL, reporting, and analysis
Easy access to data via SQL

III. Apache Spark

Spark is a general-purpose data processing engine designed to extract and process large volumes of data for a wide range of applications.

Interactive Analytics
Streams Processing
Machine Learning
Data Integration
ETL

Key attributes:

Has in-memory processing which significantly increases speed of computations
Provides interfaces for major programming languages such as Java, Scala, Python, R, and SQL.
Can run using its standalone clustering technology
Can also run on top of other infrastructures, such as Hadoop
Can access data in a large variety of data sources, including HDFS and Hive.
Process streaming data fast
Performs complex analytics in real-time