cancel
Showing results for 
Search instead for 
Did you mean: 

Comparing the schema and discard the Object

mohit_jain
New Contributor III

Hi Team,

I have one scenario and want to solve it as soon as possible.
I have two files : Initial Load , Incremental Load

We have loaded the initial load data into the DB which contains the columns like : id, name, age, class

mohit_jain_0-1691659876373.png

But We have the incremental files on the daily basis but there are some case we have the extra column which we have to discard the value and alert to the team that column is added newly How we can do in the pipeline .
Incremental file columns:

mohit_jain_1-1691660036331.png

We have to give alert that this column is roll_number is added  and also discard the columns. We cant do manually because it is a daily job and me can user mapper in this case. 

Can anyone help me in this case.

Thanks in advance.
@koryknick @dmiller 

1 ACCEPTED SOLUTION

@mohit_jain the solution is very straight forward.
1. Get the header from your initial file.
2. Merge it with your incremental file.
3. Filter match and unmatched in two different output streams.

Download solution pipeline Compare_Schema_Discard_load & below is the screenshot.

Screenshot 2023-08-17 131936.png

View solution in original post

14 REPLIES 14

Hi @darshthakkar 

We can not do the truncate load on the daily basis the only requirement is to find the extra column if we get in the future how we can handle it and send the notification that we got the extra column.

Thanks 

Got it. You'll have to use a couple of snaps for designing this.

Have a defined payload coming from source, check for any additional columns (can be achieved via a filter/router snap). If there is no change, it can go to downstream system for ingestion.

If there is a change, copy the output into 2 different stream, one for notifying the group a new column was found in the payload and the other for an ingestion to downstream in such a way that you're using an ALTER TABLE query to add a column on top of ingesting data for the existing columns.

This use case will require a lot of testing from your side and can be achieved by parent-child relation pipeline as well. I would urge you to test it out without parent-child (if you've never used it before) and then replicate the functionality in parent-child so that you know where and how the data is flowing.

Thanks.

Hi @mohit_jain ,

Attached poc pipeline that checks parameter base column names versus streaming document keys
canvas_2023_08_13.slp 

alchemiz_1-1691861516475.png

** Set reuse checked in pipeline execute

Thanks,
EmEm

mohit_jain
New Contributor III

Hi @alchemiz 

Thanks for you reply
The only thing is here we are not getting headers from the pipeline parameter( There are no parent and child pipeline in this). We have a 2 files . 
1. Initial file - which record is already dropped into the SFTP location.
2. Incremental File - File will come on the daily basis.
We have to compare both the files header and fetch the extra call out the load the unfetch column.
Here I am attaching two files for you reference can you please perform on this file in the pipeline and fetch the extra column out.
I am also attaching the pipeline for you reference.
Initial Load:

mohit_jain_0-1691997862181.png

Incremental Load: 

mohit_jain_1-1691997924461.png

Here we have to discard the Column E and load the A, B, C, D column. Also Fetch the E column and give alert that there is E column extra in the file

Pipeline Design :

mohit_jain_2-1691997980286.png

Thanks 

Mohit Jain

 

 

 

Hi @alchemiz 
Do you have got any chance to look into my reply.
Thanks 
Mohit jain