Move data from AWS S3 to Snowflake Data Warehouse
Created by @pkoppishetty The pipeline pattern moves data from AWS S3 to Snowflake Data Warehouse. The pattern contains a Directory Browser (to retrieve the file list from S3), Snowflake Execute, and Snowflake Bulkload Snaps (to load the corresponding files into Snowflake). Configuration File Reader Preview Snowflake Bulkload Output Table Data from Snowflake Select Snap Sources: AWS S3 Targets: Snowflake Snaps used: Snowflake Execute, Directory Broswer, File Reader, Snowflake Bulk Load, Snowflake Select Downloads Integration pattern for moving data from AWS S3 to Snowflake datawarehouse.slp (14.2 KB)4.6KViews0likes1CommentIngest data from SQL Server (RDBMS) to AWS Cloud Storage (S3)
Contributed by @SriramGopal from Agilisium Consulting The pipeline is designed to fetch records on an incremental basis from any RDBMS system and load to cloud storage (Amazon S3 in this case) with partitioning logic. This use case is applicable to Cloud Data Lake initiatives. This pipeline also includes, the Date based Data Partitioning at the Storage layer and Data Validation trail between source and target. Parent Pipeline Control Table check : Gets the last run details from Control table. ETL Process : Fetches the incremental source data based on Control table and loads the data to S3 Control Table write : Updates the latest run data to Control table for tracking S3 Writer Child Pipeline Audit Update Child Pipeline Control Table - Tracking The Control table is designed in such a way that it holds the source load type (RDBMS, FTP, API etc.) and the corresponding object name. Each object load will have the load start/end times and the records/ documents processed for every load. The source record fetch count and target table load count is calculated for every run. Based on the status (S-success or F-failure) of the load, automated notifications can be triggered to the technical team. Control Table Attributes: UID – Primary key SOURCE_TYPE – Type of Source RDBMS, API, Social Media, FTP etc TABLE_NAME – Table name or object name. START_DATE – Load start time ENDDATE – Load end time SRC_REC_COUNT – Source record count RGT_REC_COUNT – Target record count STATUS – ‘S’ Success and ‘F’ Failed based on the source/ target load Partitioned Load For every load, the data gets partitioned automatically based on the transaction timestamp in the storage layer (S3) Configuration Sources : RDBMS Database, SQL Server Table Targets : AWS Storage Snaps used : Parent Pipeline : Sort, File Writer, Mapper, Router, Copy, JSON Formatter, Redshift Insert, Redshift Select, Redshift - Multi Execute, S3 File Writer, S3 File Reader, Aggregate, Pipeline Execute S3 Writer Child Pipeline : Mapper, JSON Formatter, S3 File Writer Audit Update Child Pipeline : File Reader, JSON Parser, Mapper, Router, Aggregate, Redshift - Multi Execute Downloads IM_RDBMS_S3_Inc_load.slp (43.6 KB) IM_RDBMS_S3_Inc_load_S3writer.slp (12.2 KB) IM_RDBMS_S3_Inc_load_Audit_update.slp (18.3 KB)7.2KViews1like3CommentsIngest data from NoSQL Database (MongoDB) into AWS Cloud Storage (S3)
Contributed by @SriramGopal from Agilisium Consulting The pipeline is designed to fetch records on an incremental basis from document-oriented NoSQL database system (Mongo in this case) and load to cloud storage (Amazon S3) with partitioning logic. This use case is applicable to Cloud Data Lake initiatives. This pipeline also includes, the Date based Data Partitioning at the Storage layer and Data Validation trail between source and target. Parent Pipeline S3 Writer Child Pipeline Audit Update Child Pipeline Control Table - Tracking The Control table is designed in such a way that it holds the source load type (RDBMS, FTP, API etc.) and the corresponding object name. Each object load will have the load start/end times and the records/ documents processed for every load. The source record fetch count and target table load count is calculated for every run. Based on the status (S-success or F-failure) of the load, automated notifications can be triggered to the technical team. Control Table Attributes: UID – Primary key SOURCE_TYPE – Type of Source RDBMS, API, Social Media, FTP etc TABLE_NAME – Table name or object name. START_DATE – Load start time ENDDATE – Load end time SRC_REC_COUNT – Source record count RGT_REC_COUNT – Target record count STATUS – ‘S’ Success and ‘F’ Failed based on the source/ target load Partitioned Load For every load, the data gets partitioned automatically based on the transaction timestamp in the storage layer (S3) Configuration Sources : NoSQL Database, MongoDB Table Targets : AWS Storage Snaps used : Parent Pipeline: MongoDB - Find, Sort, File Writer, Mapper, Router, Copy, JSON Formatter, Redshift Insert, Redshift Select, Redshift - Multi Execute, S3 File Writer, S3 File Reader, Aggregate, Pipeline Execute S3 Writer Child Pipeline: Mapper, JSON Formatter, S3 File Writer Audit Update Child Pipeline: File Reader, JSON Parser, Mapper, Router, Aggregate, Redshift - Multi Execute Downloads IM_NoSQL_S3_Inc_load.slp (29.9 KB) IM_Nosql_S3_Inc_load_S3writer.slp (4.8 KB) IM_Nosql_S3_Inc_load_Audit_update.slp (12.0 KB)5.2KViews0likes1CommentIngest data from File Server (FTP/sFTP) into AWS Cloud Storage (S3)
Contributed by @SriramGopal from Agilisium Consulting The pipeline is designed to transfer files from a FTP/SFTP server and load them to cloud storage (Amazon S3 in this case). This use case is applicable to Cloud Data Lake initiatives. This pipeline also includes, the Date based Data Partitioning at the Storage layer and Data Validation trail between source and target. Control Table - Tracking The Control table is designed in such a way that it holds the source load type (RDBMS, FTP, API etc.) and the corresponding object name. Each object load will have the load start/end times and the records/ documents processed for every load. The source record fetch count and target table load count is calculated for every run. Based on the status (S-success or F-failure) of the load, automated notifications can be triggered to the technical team. Control Table Attributes: UID – Primary key SOURCE_TYPE – Type of Source RDBMS, API, Social Media, FTP etc TABLE_NAME – Table name or object name. START_DATE – Load start time ENDDATE – Load end time SRC_REC_COUNT – Source record count RGT_REC_COUNT – Target record count STATUS – ‘S’ Success and ‘F’ Failed based on the source/ target load Partitioned Load For every load, the data gets partitioned automatically based on the transaction timestamp in the storage layer (S3) Configuration Sources: FTP/sFTP File Extracts Targets: AWS Storage Snaps used: File Reader, File Writer, Mapper, Router, JSON Formatter, Redshift Insert, Redshift Select, Redshift - Multi Execute, S3 File Writer Downloads IM_FTP_to_S3_load.slp (15.3 KB)5KViews0likes1CommentIngest data from Salesforce application into AWS Cloud Storage (S3)
Contributed by @SriramGopal from Agilisium Consulting The pipeline is designed to fetch records from CRM application (Salesforce in this case) via API integration and load to cloud storage (Amazon S3 in this case) with partitioning logic. This use case is applicable to Cloud Data Lake initiatives. This pipeline also includes, the Date based Data Partitioning at the Storage layer and Data Validation trail between source and target. Control Table - Tracking The Control table is designed in such a way that it holds the source load type (RDBMS, FTP, API etc.) and the corresponding object name. Each object load will have the load start/end times and the records/ documents processed for every load. The source record fetch count and target table load count is calculated for every run. Based on the status (S-success or F-failure) of the load, automated notifications can be triggered to the technical team. Control Table Attributes: UID – Primary key SOURCE_TYPE – Type of Source RDBMS, API, Social Media, FTP etc TABLE_NAME – Table name or object name. START_DATE – Load start time ENDDATE – Load end time SRC_REC_COUNT – Source record count RGT_REC_COUNT – Target record count STATUS – ‘S’ Success and ‘F’ Failed based on the source/ target load Partitioned Load For every load, the data gets partitioned automatically based on the transaction timestamp in the storage layer (S3) Configuration Sources: Salesforce Account Targets: AWS Storage Snaps used: Salesforce Read, File Reader, File Writer, Mapper, Router, Copy, JSON Formatter, Redshift Insert, Redshift Select, Redshift - Multi Execute, S3 File Writer, S3 File Reader, Aggregate, JSON Parser Downloads IM_API_Salesforce_to_S3_load.slp (31.6 KB)3.8KViews0likes1CommentMove data from Database to Kafka and then to Amazon S3 Storage
Contributed by @pkona There are two pipelines in this pattern. The first pipeline extracts data from a Database and publishes to a Kafka Topic. The second pipeline consumes from the Kafka Topic and ingests into the Amazon S3 Storage by calling a third pipeline. Publish to Kafka Source: Oracle Target: Kafka Snaps used: Oracle Select, Group By N, Confluent Kafka Producer Consume from Kafka to S3 Source: Kafka Target: Pipeline Execute Snaps used: Confluent Kafka Consumer, Mapper, Pipeline Execute (calling Write file to S3 pipeline) Write file to S3 Source: Kafka Target: Amazon S3 Storage Snaps used: Mapper, JSON Formatter, File Writer Downloads Publish to Kafka.slp (5.4 KB) Consume from Kafka to S3.slp (5.4 KB) Write file to S3.slp (4.5 KB)3.4KViews0likes0Comments