These are the processes use to move data from a data source to a destination.
I. ETL (Extract, Transform, Load)
1. Definition
An automated process which includes:
- Gathering raw data
- Extract information needed for reporting and analysis
- Cleaning, stadardizing, and Transforming data into usable formats
- Load data to Data Repositories
a) Extract
Data from source is collected for transformation
2 types:
- Batch extracting: Data is collected and splitted into large chunks (called batch), move from source to destination in a timed interval. Example tools: Stinch, Blando.
- Stream extracting: Data is collected and load to the Data Repository in real-time. While in transit, the data is transformed. Tools: Apache Kamza, Apche Kafla, Apache Storm.
b) Transform
Transform involves in excecution of rules, and functions that can be use to convert raw data into usable data (ready for purposes).
Examples
- Remove duplicates
- Change the format of time
- Filter out unnecessary data
c) Load
Loading is the transportation of processed data into a data Repository. Types of Load:
- Initial loading: load all the data into the repository
- Incremental loading: applying updates and modifications perodically
- Full refresh: Erasing a dataset and reloading the data entirely.
2. Providers
- IBM Infosphere
- AWS Glue
- Improvado
- Skyvia
II. ELT (Extract, Load, Transform)
1. Definition
Is a variant of ETL and a new technology. The data is extracted first, then load into a data repository (mostly Data Lakes since they haven’t been processed yet, but sometimes still use Data Warehouse). Finally, transform the data into ready-to-use formats. The ELT Process:
- Helps process large sets of unstructured and non-reloational data
2. Advatages over ETL
- Shortens the cycle between extraction and delivery.
- Allows you to ingest volumes of raw data as immediately as the data becomes available
- Affords greater flexibility to Data Analysts and Data Scientists for EDA.
- Transforms only the data required for a particular analysis there for suit more use cases.
- Suitable for Big Data
III. Data Pipelines
Encompasses the entire journey of moving data from one system to another, including ETL and ELT processes Can be architect for both batch and streaming data Supports long-running batch queries and smaller interactive queries Typical loads data into a data lake but can also loads to other data destinations.
Providers
- Beam
- Airflow
- Dataflow