Big Data processing technologies provide ways to work with extensive sets of structured, semi-structured, and unstructured data so that values can be derived from Big Data.
There are several tools to process Big Data, but the most well-known are:
- Apache Hadoop: a collection of tools that provides distributed storage and processing of big data.
- Apache Hive: a data warehouse for data query and analysis built on top of Hadoop
- Apache Spark: distributed analytics framework for complex, real-time data analytics
I. Apache Hadoop
1. Hadoop and its Benefits
Is a Java-based software that distributed storage and processing of large datasets across clusters of computers.
Each computer is a node (can be NameNode - the master server and DataNode) and a number of nodes form a cluster.
Hadoop provides a reliable, scalable and cost-effective solution for storing data with no format requirements.
Benefits:
- Better real-time data-driven decisions: Incorporates emerging data formats not traditionally used in data warehouses.
- Improved data access and analysis: Provides real-time, self-service access to stakeholders
- Data offload and consolidation: Optimizes and streamlines costs by consolidating data, including cold data, across the organization.
2. Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a storage system for big data that runs on multiple commodity hardware connected through a network.
Main features:
- Provides scalable and reliable big data storage by partitioning files over multiple nodes.
- Splits large files across multiple computers, allowing parallel access to them
- Replicate file blocks on different nodes to prevent data loss
Other benefits:
- Fast recovery from hardware failures, because HDFS is built to detect faults and automatically recover
- Access to streaming data, HDFS supports high data throughput rates
- Accommodation of large datasets, because HDFS can scale to hundreds of nodes, or computers, in a single node
- Portability, because HDFS is portable across multiple hardware platforms and compatible with many OSes (Java app)
II. Apache Hive
Hive is an open-source data warehouse software for reading, writing and managing large dataset files that are stored directly in HDFS or other data storage systems.
Attributes:
- Queries have high latency => not suitable for applications that need fast response time.
- Hive is read-based => not suitable for heavy data writing tasks.
- Best fit for Data warehouse: ETL, reporting, and analysis
- Easy access to data via SQL
III. Apache Spark
Spark is a general-purpose data processing engine designed to extract and process large volumes of data for a wide range of applications.
- Interactive Analytics
- Streams Processing
- Machine Learning
- Data Integration
- ETL
Key attributes:
- Has in-memory processing which significantly increases speed of computations
- Provides interfaces for major programming languages such as Java, Scala, Python, R, and SQL.
- Can run using its standalone clustering technology
- Can also run on top of other infrastructures, such as Hadoop
- Can access data in a large variety of data sources, including HDFS and Hive.
- Process streaming data fast
- Performs complex analytics in real-time