11-19-2021 12:00 AM
Before onboarding SnapLogic, we were using Informatica IDQ tool for our Data Quality (DQ) Automation and Executing DQ validation rules on our business data sets.
Implementing new DQ rule or modifying existing rule was comparatively huge effort due to below reasons :
Informatica IDQ skillset was limited to specific team members.
Creating a DQ rule in Informatica IDQ based on a SQL is time consuming, as we cannot reuse same SQL that we use to analyze our data sets & each DQ rule in Informatica IDQ needs to be developed from scratch. A kind of doing reverse engineering on SQLs submitted by Data Analysts & Data Engineers.
Maintaining & Managing Informatica IDQ tools incurs a separate cost.
Onboarding a new data source system in Informatica requires additional efforts for configurations.
Because of drawbacks of Informatica IDQ that we faced, we wanted to explore on how we can leverage capabilities of SnapLogic to implement Data Quality framework that is SQL friendly i.e. Everyone who knows SQL can implement DQ validation checks easily, thus removing skill dependency on Informatica IDQ and saving huge amount on Cost of Maintenance for Informatica IDQ.
We implemented an Automated Data Quality Validation Framework using SnapLogic & Snowflake (Cloud Database) .
Benefits:
Development time significantly reduced.
we were able to deploy new DQ rules or enhance existing DQ rule for our ever growing data sets in comparatively 50 times faster compared to Informatica IDQ.
No dependency on additional skill set (Informatica IDQ tool).
Everyone who can write SQL statements can create, modify & deploy DQ rules easily.
This gives flexibility to Data Analysts & Data Engineers to quickly deploy DQ rules for their data sets and thus removing any dependency on separate person or team who specializes in Informatica IDQ skillset.
Onboarding new data sources is comparatively easy in SnapLogic.
i.e., Structured or Unstructured data can be easily consumed using various prebuilt Snaps.
Flexibility to leverage best features of SnapLogic & Snowflake both,We got additional capabilities to trigger ‘Alerts’ for DQ failure or DQ event occurrence using Email Snap pack in SnapLogic, and there is some much explore to customize our framework using various Snap packs that are available.
Our Automation Framework is built on 2 main components :
A. SnapLogic :
Complete DQ framework is created using various capabilities & features provided by SnapLogic tool.
This framework is created as a template that is scalable & can modified as per future requests with minimal efforts.
Below tasks were successfully accomplished using SnapLogic.
Reading Data from multiple sources like Hana, Snowflake, Excel Files etc. using various prebuilt data reading Snap packs.
SQLs created by Data Analysts & Data Engineers act as parameter for our framework, this parameter values will change for each DQ scenario & our framework will generate results based on incoming parameters values.
We make use of Task Scheduler feature of SnapLogic to schedule our Master Job, that triggers all DQ checks that are defined in our config table maintained in Snowflake.
Email Snap Pack is used to trigger alerts to users in case of
i. DQ rule failure
ii. If a desired event occurs in data set
iii. Generate summarized DQ execution report over email.
iv. REST & Twilio snap packs can be used to generate text message
Alerts.
Few examples of the Snap packs that are used in this framework are
• Flow – Pipeline Execute, Router, Join, Sort, Union
• JDBC –Execute, Select
• Transform – Mapper
• Email – Email Sender
B. Snowflake :
It’s a database to store our results & provides analytical capabilities.
This help to maintain any config data for our framework in SnapLogic
Store results based on DQ rule execution.
Provides further analytical capabilities by making our automated Data Quality execution results available over Visualization Dashboard (like Tableau, Power BI) that is easily accessible to all our users.
Users can get summarized as well detailed view of Data Quality Results over dashboards & If required drill down to check invalid records for a particular rule.
We are also able to generate Historical Trend Analysis for Data Quality Rules that are deployed in our prod environment.
Implementing Data Quality framework on Snaplogic provided us great benefits & opened up so many possibilities to explore using various snap packs.
Productivity Improvement :
we can use same SQLs for implementing DQ checks that are used for data analysis, thus saving a lot of time by avoiding setup time in Informatic IDQ.
Speed Improvement :
we are no longer dependent on Informatica IDQ skillset, as we can deploy DQ rules almost 10 times faster using our SQL knowledge.
Reduced Manual Inefficiencies :
DQ framework is a templatized model i.e., one time effort is required to setup pipelines & framework in Snaplogic, after that it is reused to execute multiple DQ rules (SQL queries).
So unlike informatica IDQ where each rule requires manual mapping creation & configuration, we don’t spend manual efforts in doing repetitive tasks using out new framework.
Cost effective :
We don’t have spent extra money on maintaining Informatica IDQ servers & IDQ specific resources to implement our DQ checks.
we can achieve all Data Quality related requirements using SnapLogic & snowflake.
Easy Maintenance :
Modifying any existing DQ rules is as easy as updating a SQL statement in a config table.
6.Scalable & Easy Customizable :
Having DQ framework on SnapLogic & Snowflake, gives us so much flexibility to customize (using various Snap packs) & scale as per different business requirements.
We can leverage features of both SnapLogic & Snowflake into this framework.
This solution was designed & implemented by myself (Ruchik Thakkar) with support & guidance from my manager Jim Ferris. Both of us are FTE for Illumina.
#Anything else you would like to add?
Onboarding SnapLogic was a real game changer.
It not only allowed in usual Data Onboarding Process but we were able to implement it successfully for developing a Data Quality Framework.
SnapLogic provided us a way where we can templatized this solution,
so it’s easy to reuse for other teams, easy to customize as per Business Requirements & Scalable as needed.