cancel
Showing results for 
Search instead for 
Did you mean: 

Ingest data from Database (Oracle) into the Hadoop Data Lake (HDFS+Hive) using the AVRO as the Data format

pkona
Former Employee

Created by @pkona


This pattern ingests all the tables present in an Oracle database into the Hadoop Data Lake. This pattern will achieve the following:

  1. Automatically creates Avro schemas and Hive tables
  2. Automatically shards large tables
  3. Ingests data into HDFS+Hive tables
  4. All the pipeline execution is done to maximize parallelized execution of the child pipelines across multiple nodes in the Snaplex

Pipelines in this Patter

Pattern 3 - Step 1.0 - Oracle to Hive Parent

46%20AM
The master pipeline :

  • Gets a list of all source tables.

  • Calls a child pipeline (Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata) to create schemas and tables on Hive.

  • Calls a child pipeline for a table (Pattern 3 - Step 1.2 - Oracle to Hive No Sharding)

  • The Pipeline Execute controls the number of tables to be processed in parallel.

Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata

28%20AM

  • Creates schemas required for file data format
  • Creates Hive table
  • Returns folder location used by Hive Table
    Uses pipelines Pattern 3 - Core - Write schema to HDFS and Write Hive Table DDL to HDFS

Pattern 3 - Step 1.2 - Oracle to Hive No Sharding

47%20AM

  • Processes data for a single table
    • If the source table is small, writes a single file.
    • If table is large, calculate shard size and call child pipeline to process a shard
    • The Pipeline Execute controls the number of shards to be processed in parallel (Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard)

Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard

27%20AM

  • Processes a shard
  • Queries source database for a shard
  • Writes a file for the data
  • Each shard has its own file under the Hive table’s folder

Configuration

Sources: Oracle
Targets: HDFS, Hive, AVRO
Snaps used: Oracle, HDFS, Hive, JSON Generator

Downloads

Project of all Pipelines

Community Patterns_DW_ DB to Hadoop DL.zip (11.6 KB)

Individual Pipelines

Pattern 3 - Step 1.0 - Oracle to Hive Parent_2018_08_24.slp (14.7 KB)
Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata_2018_08_24.s... (22.0 KB)
Pattern 3 - Step 1.2 - Oracle to Hive No Sharding_2018_08_24.slp (15.3 KB)
Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard_2018_08_24.slp (9.8 KB)
Pattern - Core - Query Hive Table Metadata_2018_08_24.slp (5.1 KB)
Pattern 3 - Core - Write schema to HDFS_2018_08_24.slp (8.4 KB)

2 REPLIES 2

eric_sou
New Contributor III

Hi Prasad Kona, this is a great pattern.
To me this is the initial set up to get all your data (on the run time) into your target environment.
Now, how would you ensure that your data gets constantly updated (missing records added, data altered in Oracle DB etc.)
This is the part I am struggling with, keep the data sync from Oracle to Hadoop.
Thanks,

dmiller
Former Employee

Prasad is no longer in this community. Perhaps someone else who has used this pattern can provide further insight.


Diane Miller