These are the processes use to move data from a data source to a destination.

I. ETL (Extract, Transform, Load)

1. Definition

An automated process which includes:

Gathering raw data
Extract information needed for reporting and analysis
Cleaning, stadardizing, and Transforming data into usable formats
Load data to Data Repositories

a) Extract

Data from source is collected for transformation

2 types:

Batch extracting: Data is collected and splitted into large chunks (called batch), move from source to destination in a timed interval. Example tools: Stinch, Blando.
Stream extracting: Data is collected and load to the Data Repository in real-time. While in transit, the data is transformed. Tools: Apache Kamza, Apche Kafla, Apache Storm.

b) Transform

Transform involves in excecution of rules, and functions that can be use to convert raw data into usable data (ready for purposes).

Examples

Remove duplicates
Change the format of time
Filter out unnecessary data

c) Load

Loading is the transportation of processed data into a data Repository. Types of Load:

Initial loading: load all the data into the repository
Incremental loading: applying updates and modifications perodically
Full refresh: Erasing a dataset and reloading the data entirely.

2. Providers

IBM Infosphere
AWS Glue
Improvado
Skyvia

II. ELT (Extract, Load, Transform)

1. Definition

Is a variant of ETL and a new technology. The data is extracted first, then load into a data repository (mostly Data Lakes since they haven’t been processed yet, but sometimes still use Data Warehouse). Finally, transform the data into ready-to-use formats. The ELT Process:

Helps process large sets of unstructured and non-reloational data

2. Advatages over ETL

Shortens the cycle between extraction and delivery.
Allows you to ingest volumes of raw data as immediately as the data becomes available
Affords greater flexibility to Data Analysts and Data Scientists for EDA.
Transforms only the data required for a particular analysis there for suit more use cases.
Suitable for Big Data

III. Data Pipelines

Encompasses the entire journey of moving data from one system to another, including ETL and ELT processes Can be architect for both batch and streaming data Supports long-running batch queries and smaller interactive queries Typical loads data into a data lake but can also loads to other data destinations.

Providers

Beam
Airflow
Dataflow