Ingest data from Database (Oracle) into the Hadoop Data Lake (HDFS+Hive) using the AVRO as the Data format

Created by @pkona


This pattern ingests all the tables present in an Oracle database into the Hadoop Data Lake. This pattern will achieve the following:

  1. Automatically creates Avro schemas and Hive tables
  2. Automatically shards large tables
  3. Ingests data into HDFS+Hive tables
  4. All the pipeline execution is done to maximize parallelized execution of the child pipelines across multiple nodes in the Snaplex

Pipelines in this Patter

Pattern 3 - Step 1.0 - Oracle to Hive Parent


The master pipeline :

  • Gets a list of all source tables.

  • Calls a child pipeline (Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata) to create schemas and tables on Hive.

  • Calls a child pipeline for a table (Pattern 3 - Step 1.2 - Oracle to Hive No Sharding)

  • The Pipeline Execute controls the number of tables to be processed in parallel.

Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata

  • Creates schemas required for file data format
  • Creates Hive table
  • Returns folder location used by Hive Table
    Uses pipelines Pattern 3 - Core - Write schema to HDFS and Write Hive Table DDL to HDFS

Pattern 3 - Step 1.2 - Oracle to Hive No Sharding

  • Processes data for a single table
    • If the source table is small, writes a single file.
    • If table is large, calculate shard size and call child pipeline to process a shard
    • The Pipeline Execute controls the number of shards to be processed in parallel (Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard)

Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard

  • Processes a shard
  • Queries source database for a shard
  • Writes a file for the data
  • Each shard has its own file under the Hive table’s folder

Configuration

Sources: Oracle
Targets: HDFS, Hive, AVRO
Snaps used: Oracle, HDFS, Hive, JSON Generator

Downloads

Project of all Pipelines

Community Patterns_DW_ DB to Hadoop DL.zip (11.6 KB)

Individual Pipelines

Pattern 3 - Step 1.0 - Oracle to Hive Parent_2018_08_24.slp (14.7 KB)
Pattern 3 - Step 1.1 - Ingest - Generate Hive table Avro based on Oracle Table metadata_2018_08_24.slp (22.0 KB)
Pattern 3 - Step 1.2 - Oracle to Hive No Sharding_2018_08_24.slp (15.3 KB)
Pattern 3 - Step 1.2.1 Oracle to Hive - process a shard_2018_08_24.slp (9.8 KB)
Pattern - Core - Query Hive Table Metadata_2018_08_24.slp (5.1 KB)
Pattern 3 - Core - Write schema to HDFS_2018_08_24.slp (8.4 KB)