Join using between clause

Hi All,

I am looking for directions on how to perform a join based on between clause. I would like to join these 2 sources based on Source1.Key between source2.Key Start and source2.Key End

Source 1 :

|Key|Name|
|0|ABC0|
|1|ABC1|
|2|ABC2|
|3|ABC3|
|4|ABC4|
|5|ABC5|
|6|ABC6|
|7|ABC7|
|8|ABC8|
|9|ABC9|
|10|ABC10|
|11|ABC11|
|12|ABC12|
|13|ABC13|
|14|ABC14|
|15|ABC15|
|16|ABC16|

Source 2 :

|Key Start|Key End|Grp_number|
|0|0|1|
|1|4|2|
|5|10|3|
|11|13|4|
|14|16|5|

Expected Output :

|Key|Name|Key Start|Key End|Grp_number|
|0|ABC0|0|0|1|
|1|ABC1|1|4|2|
|2|ABC2|1|4|2|
|3|ABC3|1|4|2|
|4|ABC4|1|4|2|
|5|ABC5|5|10|3|
|6|ABC6|5|10|3|
|7|ABC7|5|10|3|
|8|ABC8|5|10|3|
|9|ABC9|5|10|3|
|10|ABC10|5|10|3|
|11|ABC11|11|13|4|
|12|ABC12|11|13|4|
|13|ABC13|11|13|4|
|14|ABC14|14|16|5|
|15|ABC15|14|16|5|
|16|ABC16|14|16|5|

koryknick
5 years ago
OK - got it. SnapLogic does not support the “between” operation in Join snaps. So to accomplish this in a streaming manner, I propose the attached. Basically, it generates the missing records on the bottom path before the join. This should keep memory overhead low since the only arrays kept in memory are the arrays generated to fill the gaps.
Join Between_2021_06_29.slp (10.1 KB)

Of course, this does produce many more records on that bottom path, but since you are using Inner join, the memory in the Join is still low as well.

10 Replies

koryknick
Employee
5 years ago
OK - got it. SnapLogic does not support the “between” operation in Join snaps. So to accomplish this in a streaming manner, I propose the attached. Basically, it generates the missing records on the bottom path before the join. This should keep memory overhead low since the only arrays kept in memory are the arrays generated to fill the gaps.
Join Between_2021_06_29.slp (10.1 KB)

Of course, this does produce many more records on that bottom path, but since you are using Inner join, the memory in the Join is still low as well.
bojanvelevski
Valued Contributor
5 years ago
Hey @Majid,

I’ve uploaded a pipeline with a solution to your problem. The second mapper is just to rearrange the fields, you don’t really need it if the order doesn’t matter.

Join Between_2021_06_29.slp (10.0 KB)

Hope you’ll find this helpful,
Regards,

Bojan
Join Between_2021_06_29.slp10 KB
Majid
New Contributor III
5 years ago
Thank You @bojanvelevski for the great solution. I tried to use this in my use case , it is working fine when we have small set of records but when there are lot of records Join after gate seems to be taking lot of time with a thru put of 3.3 docs/second. In my use case I have 6 million records.
- cjhoward18
  Employee
  5 years ago
  @Majid
  The Gate snap is designed to wait for all upstream snaps to finish executing as well as collect all the input documents received into a single output document. I don’t think Gate would be the right choice here if there are 6 million records being collected into a single document, this may be your bottleneck.
- bojanvelevski
  Valued Contributor
  5 years ago
  Yes, I agree. You can try and optimize the process. Separate the incoming records into smaller chunks with the Group By N snap instead of Gate. And use Pipeline Execute, with reuse executions and increased pool size. But be careful on the resources, it’s pretty large amount of data.
Majid
New Contributor III
5 years ago
@cjhoward18 is there any other recommendation for achieving join based on between clause
koryknick
Employee
5 years ago
What is the source of this data? Is this coming from a file or a database or what?

Can you do the join in the source before it it sent to SnapLogic? Generally, it is preferred to do that type of work (i.e. joins, filters, sorts, and aggregates) in the source when possible.
Majid
New Contributor III
5 years ago
Hi @koryknick I am not able to do join at DB level as my source files are Mainframe binary format VSAM files with COMP3 fields in AWS S3. I am using Copybook parser to change the files from EBCDIC into ASCII using the corresponding copybook before I can join them.
Majid
New Contributor III
5 years ago
Thank you so much @koryknick. This is working as expected and also performance issue is resolved using this approach. I had earlier tried to use sequence generator to generate multiple documents by passing start and end values as parameters but that approach was taking lot of time because it was triggering millions of child pipeline with sequence generator.
- koryknick
  Employee
  5 years ago
  I’m happy to hear this solution worked! Good luck with the rest of your implementation.

Forum Discussion

Join using between clause

10 Replies

Recent Discussions

Way to lock down in Prod org to "Monitor" only access?

trace API and proxy calls

Pagination Logic Fails After Migrating from REST GET to HTTP Client Snap

Pipeline Execute Pool size

Concat values of a field based on value of another field