Ingest data from Database (Oracle) into the Hadoop Data Lake (HDFS+Hive) using the AVRO as the Data format
Created by @pkona
This pattern ingests all the tables present in an Oracle database into the Hadoop Data Lake. This pattern will achieve the following:
- Automatically creates Avro schemas and Hive tables
- Automatically shards large tables
- Ingests data into HDFS+Hive tables
- All the pipeline execution is done to maximize parallelized execution of the child pipelines across multiple nodes in the Snaplex
Pipelines in this Patter
Pattern 3 - Step 1.0 - Oracle to Hive Parent
The master pipeline :
-
Gets a list of all source tables.
-
Calls a child pipeline (Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata) to create schemas and tables on Hive.
-
Calls a child pipeline for a table (Pattern 3 - Step 1.2 - Oracle to Hive No Sharding)
-
The Pipeline Execute controls the number of tables to be processed in parallel.
Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata
- Creates schemas required for file data format
- Creates Hive table
- Returns folder location used by Hive Table
Uses pipelines Pattern 3 - Core - Write schema to HDFS and Write Hive Table DDL to HDFS
Pattern 3 - Step 1.2 - Oracle to Hive No Sharding
- Processes data for a single table
- If the source table is small, writes a single file.
- If table is large, calculate shard size and call child pipeline to process a shard
- The Pipeline Execute controls the number of shards to be processed in parallel (Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard)
Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard
- Processes a shard
- Queries source database for a shard
- Writes a file for the data
- Each shard has its own file under the Hive table’s folder
Configuration
Sources: Oracle
Targets: HDFS, Hive, AVRO
Snaps used: Oracle, HDFS, Hive, JSON Generator
Downloads
Project of all Pipelines
Community Patterns_DW_ DB to Hadoop DL.zip (11.6 KB)
Individual Pipelines
Pattern 3 - Step 1.0 - Oracle to Hive Parent_2018_08_24.slp (14.7 KB)
Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata_2018_08_24.slp (22.0 KB)
Pattern 3 - Step 1.2 - Oracle to Hive No Sharding_2018_08_24.slp (15.3 KB)
Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard_2018_08_24.slp (9.8 KB)
Pattern - Core - Query Hive Table Metadata_2018_08_24.slp (5.1 KB)
Pattern 3 - Core - Write schema to HDFS_2018_08_24.slp (8.4 KB)