cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Dynamic Data Validation

Ksivagurunathan
Contributor

I want to validate the incoming data which will be in CSV format. I have Target table defined for each file. Now I want to develop a common pipeline which should validate the data based on the rule for each file.
Validation Rules - Not null, length and datatype. I can extract these validation information from target database dynamically using sql queries based on the file name. How do I merge this rules with incoming data ? My goal is to develop one pipeline and even if there are changes in database for new field addition or datatype change, I shouldnโ€™t change my pipeline. my sql queries can pull the latest fields, datatype and other req information.

Basically iโ€™m trying to verify the data in the file before I apply them to target table. So I capture error records before I perform my DML operation

9 REPLIES 9

tstack
Former Employee

The DataValidator snap is useful if the constraints are statically known. But, since youโ€™re trying to do this dynamically, weโ€™ll need to find another way.

Can you give an example of what these rules look like (preferably in the JSON format that would be returned by a SQL Select snap).

Thanks for your quick response. Here is the example - โ€œ{SOURCE_COLUMN_NAME: COLUMN_NAME1:[{DATATYPE: NUMBER, MAX_LENGTH: null,PRECISION :25,NUMERIC_SCALE:7,NULLABLE:NO}] }โ€
โ€œ{SOURCE_COLUMN_NAME: COLUMN_NAME2:[{DATATYPE: VARCHAR, MAX_LENGTH: 50 ,PRECISION :null ,NUMERIC_SCALE:null ,NULLABLE:NO}] }โ€

We could change this to different format if required. I have challenge in joining this with actual data.

The simplest thing would be to collect all the schema documents together with a GroupByN snap and then use a Join snap to add the schema to all the documents.

Can I ask why you want to do the validation yourself instead of letting the database do the validation? If the tables in the DB are already setup with the appropriate constraints, I would think that would be the most straightforward approach.

we want to validate the data sooner than doing in our batch processing which loads data to table. We could create triggered task and use that to validate the errors in data immediately and that way we could reuse some of the validation code rather developing one for each and every file. Data validation rules are going to be similar for any delimited files.