Ingest data from Database (Oracle) into the Hadoop Data Lake (HDFS+Hive) using the AVRO as the Data format
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎08-24-2018 10:43 AM
Created by @pkona
This pattern ingests all the tables present in an Oracle database into the Hadoop Data Lake. This pattern will achieve the following:
- Automatically creates Avro schemas and Hive tables
- Automatically shards large tables
- Ingests data into HDFS+Hive tables
- All the pipeline execution is done to maximize parallelized execution of the child pipelines across multiple nodes in the Snaplex
Pipelines in this Patter
Pattern 3 - Step 1.0 - Oracle to Hive Parent
The master pipeline :
-
Gets a list of all source tables.
-
Calls a child pipeline (Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata) to create schemas and tables on Hive.
-
Calls a child pipeline for a table (Pattern 3 - Step 1.2 - Oracle to Hive No Sharding)
-
The Pipeline Execute controls the number of tables to be processed in parallel.
Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata
- Creates schemas required for file data format
- Creates Hive table
- Returns folder location used by Hive Table
Uses pipelines Pattern 3 - Core - Write schema to HDFS and Write Hive Table DDL to HDFS
Pattern 3 - Step 1.2 - Oracle to Hive No Sharding
- Processes data for a single table
- If the source table is small, writes a single file.
- If table is large, calculate shard size and call child pipeline to process a shard
- The Pipeline Execute controls the number of shards to be processed in parallel (Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard)
Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard
- Processes a shard
- Queries source database for a shard
- Writes a file for the data
- Each shard has its own file under the Hive table’s folder
Configuration
Sources: Oracle
Targets: HDFS, Hive, AVRO
Snaps used: Oracle, HDFS, Hive, JSON Generator
Downloads
Project of all Pipelines
Community Patterns_DW_ DB to Hadoop DL.zip (11.6 KB)
Individual Pipelines
Pattern 3 - Step 1.0 - Oracle to Hive Parent_2018_08_24.slp (14.7 KB)
Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata_2018_08_24.s... (22.0 KB)
Pattern 3 - Step 1.2 - Oracle to Hive No Sharding_2018_08_24.slp (15.3 KB)
Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard_2018_08_24.slp (9.8 KB)
Pattern - Core - Query Hive Table Metadata_2018_08_24.slp (5.1 KB)
Pattern 3 - Core - Write schema to HDFS_2018_08_24.slp (8.4 KB)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎06-02-2020 07:08 AM
Hi Prasad Kona, this is a great pattern.
To me this is the initial set up to get all your data (on the run time) into your target environment.
Now, how would you ensure that your data gets constantly updated (missing records added, data altered in Oracle DB etc.)
This is the part I am struggling with, keep the data sync from Oracle to Hadoop.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎06-02-2020 07:42 AM
Prasad is no longer in this community. Perhaps someone else who has used this pattern can provide further insight.
Diane Miller
