Ingest data from Database (Oracle) into the Hadoop Data Lake (HDFS+Hive) using the AVRO as the Data format

Question

Created by @pkona

This pattern ingests all the tables present in an Oracle database into the Hadoop Data Lake. This pattern will achieve the following:

Automatically creates Avro schemas and Hive tables
Automatically shards large tables
Ingests data into HDFS+Hive tables
All the pipeline execution is done to maximize parallelized execution of the child pipelines across multiple nodes in the Snaplex

Pipelines in this Patter

Pattern 3 - Step 1.0 - Oracle to Hive Parent

The master pipeline :

Gets a list of all source tables.

Calls a child pipeline (Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata) to create schemas and tables on Hive.

Calls a child pipeline for a table (Pattern 3 - Step 1.2 - Oracle to Hive No Sharding)

The Pipeline Execute controls the number of tables to be processed in parallel.

Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata

Creates schemas required for file data format
Creates Hive table
Returns folder location used by Hive Table
Uses pipelines Pattern 3 - Core  - Write schema to HDFS and Write Hive Table DDL to HDFS

Pattern 3 - Step 1.2 - Oracle to Hive No Sharding

Processes data for a single table

If the source table is small, writes a single file.
If table is large, calculate shard size and call child pipeline to process a shard
The Pipeline Execute controls the number of shards to be processed in parallel (Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard)

Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard

Processes a shard
Queries source database for a shard
Writes a file for the data
Each shard has its own file under the Hive table’s folder

Configuration
Sources: Oracle
Targets: HDFS, Hive, AVRO
Snaps used: Oracle, HDFS, Hive, JSON Generator

Downloads

Project of all Pipelines
Community Patterns_DW_ DB to Hadoop DL.zip (11.6 KB)

Individual Pipelines
Pattern 3 - Step 1.0 - Oracle to Hive Parent_2018_08_24.slp (14.7 KB)
Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata_2018_08_24.slp (22.0 KB)
Pattern 3 - Step 1.2 - Oracle to Hive No Sharding_2018_08_24.slp (15.3 KB)
Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard_2018_08_24.slp (9.8 KB)
Pattern - Core - Query Hive Table Metadata_2018_08_24.slp (5.1 KB)
Pattern 3 - Core - Write schema to HDFS_2018_08_24.slp (8.4 KB)

eric_sou · Answer

Hi Prasad Kona, this is a great pattern.
To me this is the initial set up to get all your data (on the run time) into your target environment.
Now, how would you ensure that your data gets constantly updated (missing records added, data altered in Oracle DB etc.)
This is the part I am struggling with, keep the data sync from Oracle to Hadoop.
Thanks,

dmiller · Answer

Prasad is no longer in this community. Perhaps someone else who has used this pattern can provide further insight.

Forum Discussion

Ingest data from Database (Oracle) into the Hadoop Data Lake (HDFS+Hive) using the AVRO as the Data format

Pipelines in this Patter

Pattern 3 - Step 1.0 - Oracle to Hive Parent

Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata

Pattern 3 - Step 1.2 - Oracle to Hive No Sharding

Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard

Configuration

Downloads

Project of all Pipelines

Individual Pipelines

2 Replies

Recent Discussions

File Extraction Patterns

Secret Management vs Enhanced Account Encryption

sequence month of dates

Flatten JSON files into CSV files

Filter out the specific column