Forum Discussion
kumar25 - Do you just need something like the following?
input-1 line1
input-2 line1
input-3 line1
input-1 line2
...etc?
I've attached an example pipeline how to accomplish this. Basically, it's adding variables to track the input the data is coming from and the record number of each record. Once the data is combined using Union, sort the data on the record number and input view. Finally, remove the temporary values used for this process. Note the use of the "Passthrough" option in the Mappers so I don't need to know what the record layout is.
Hope this helps!
PS - I appreciate Aleksandar_A 's contribution; however, I do warn against using the Gate snap with the "All input documents" setting. If your input dataset is large, it can consume considerable resources on your execution nodes, causing other pipelines to pause to wait on resources, or worst case can crash the node depending on other activity. Gate is a powerful snap and can be used very effectively, just remember:
Hi Prasad Kona, this is a great pattern.
To me this is the initial set up to get all your data (on the run time) into your target environment.
Now, how would you ensure that your data gets constantly updated (missing records added, data altered in Oracle DB etc.)
This is the part I am struggling with, keep the data sync from Oracle to Hadoop.
Thanks,