We have a use case where we need to read a large amount of data from an on-premises RDBMS (JDBC) and write it as multiple files in S3. No matter what, the data has to travel across the network from our on-premises data center to AWS. I’m looking for suggestions on the most efficient pipeline design.
Options we’ve considered:
- A single pipeline that runs on on-premises snaplex, reads the data from the RDBMS and writes it to S3.
- A single pipeline that runs on AWS snaplex, reads the data from the RDBMS and writes it to S3.
- A parent/child pipeline where the parent runs on on-premises snaplex and reads the data from the RDBMS, then uses a Pipeline Execute snap to execute a pipeline on the AWS snaplex to write the data to S3. The data passes over the network to an unconnected input view in the child pipeline.