How do I avoid multiple select requests

Question

I have multiple data-stores and need to hydrate data pulled from one with data from another.
The issue is that if I have two independent staring snaps perform the queries, I would have to pull every item in the table (hundreds of thousands) just to join one column with the simple ID field from the first data-set.
It doesn’t make sense to me to pull every item from a table when the first query provides the list of IDs that needs to be pulled.  If I put a query snap in the middle of the pipeline then it creates a select statement for each data-point which is very bad as it results in 10K+ individual select statements that are just looking for IDs.
I cannot find anything anywhere on how to group queries to do this.  I have tried using the aggregate snap and the group by snaps and cannot find a way to do this.
Is it the case that SnapLogic is built in a way that inhibits me from doing this?
It seems like a lot of overhead to do something like write to a file then read the file just to get an array of IDs to perform a query.
I see query logs in Redshift that seem to point to queries sometimes being ‘chunked’ for these inline select snaps but for the most part they don’t, and I am unsure if this is just spotty visibility using the query viewer in Redshift.
Any help is appreciated, Thanks.

tstack · Answer

shane:

If I put a query snap in the middle of the pipeline then it creates a select statement for each data-point which is very bad as it results in 10K+ individual select statements that are just looking for IDs.

Ideally, you’d be able to write a WHERE clause in the Select snap that referenced the ID array in the input document (e.g. the_id in $the_ids).  But, I don’t think that’s well supported by JDBC drivers.  For now, you’ll probably need to use an expression to expand the array into something like the_id in ($the_ids[0], $the_ids[1], ..., $the_ids[N]).  Since that’s a bit complicated, I’ve created an expression library that has a function to do just that:

github.com

SnapLogic/contrib-elib/blob/master/sqlhelper.expr

{
    /**
     * Split an array into fixed-sized chunks.
     *
     * @param  {Array}   arr  The array to split into chunks
     * @param  {Integer} size The maximum size of each chunk
     * @return {Array}        An array of arrays
     */
    _chunkArray: (arr, size) =&gt;
        sl.range(arr.length / size).map(x =&gt; arr.slice(x * size, (x * size) + size)),

/**
     * Generate a SQL fragment for testing if a column value is in an array.
     *
     * @param  {String}  column  The name of the column to test.
     * @param  {Object}  doc     The input document
     * @param  {String}  path    The JSON-Path to the array containing the values.
     * @param  {Integer} maxSize The maximum number of elements that can be in a SQL array.
     * @return {String}          The SQL fragment.  (Example: name in ($names[0], $names[1]))
     */

This file has been truncated. show original

You can include that library in your pipeline and then use the following snippet to call the function in the “Where clause” property of the Select snap:
lib.sqlhelper.inArray('the_id', $, '$the_ids')

Here’s a screenshot of a Select snap with the property configured and an example of the generated SQL fragment:

I’ll also file a feature request to see if the snaps can be updated to more naturally support this use case.

tstack · Answer

shane:

If I put a query snap in the middle of the pipeline then it creates a select statement for each data-point which is very bad as it results in 10K+ individual select statements that are just looking for IDs.

Ah, I forgot there’s a Lookup snap that will batch multiple input documents into a single SELECT query with a WHERE clause that is dynamically generated.  That might work better for you.

shane · Answer

Thanks for the feedback!
I am receiving an error using your expression, it seems to not like a BigInteger type:

I think this may work with a little adjustment though.  I tried converting each id into a string value first, but it seems that this error is thrown on the arr.length which isn’t referencing the value but rather the array itself.
This is the output of the snap feeding my query snap: (if it helps)

I have not been able to find a way to turn this into an array to pass in.
As far as the Redshift Lookup snap, I will have to take the time to configure the connections as it looks like I will have to create a SnapLogic IAM role and everything.  There is also the added cost of the Redshift snap pack just to do a simple select statement that I have to make sure my higher-ups are ok with.
Also, I have need to do this type of query into MSSQL and MySQL databases as well so the Redshift snaps wouldn’t help much there.  It would be nice to have a way to expose or pass in the document array itself to another snap, or to the expr file for more control.
Thanks again for all of the help.

Forum Discussion

How do I avoid multiple select requests

3 Replies

Recent Discussions

Way to lock down in Prod org to "Monitor" only access?

trace API and proxy calls

Pagination Logic Fails After Migrating from REST GET to HTTP Client Snap

Pipeline Execute Pool size

Concat values of a field based on value of another field