05-01-2018 02:28 PM
Crown has a pipeline that gets stuck on the join snap and never finishes. The pipeline (CrownInsiteDev/User Projects/Kim Miesse/ load_fact_daily_hour_meter) will complete and return the expected data if we configure the Join snap with a join type of Left outer and data as unsorted. The pipeline completes in about a minute in .that scenario
I was able to get the pipeline to complete in less than 30 seconds by adding sort snaps before each of the join snap input views. The customer realized that would work but the Join snap in question was set to unsorted so they think it should work without sorts.
My recommendation was that this behavior was due to the Merge Algorithm. The Merge algorithm is the most efficient way to join between two very large sets of data which are both sorted on the join key. The Merge Join simultaneously reads a row from each input and compares them using the join key. If there’s a match, they are returned. Otherwise, the row with the smaller value can be discarded because, since both inputs are sorted, the discarded row will not match any other row on the other set of data. This repeats until one of the tables is completed. Even if there are still rows on the other table, they will clearly not match any rows on the fully-scanned table, so there is no need to continue. Since both tables can potentially be scanned, the maximum cost of a Merge Join is the sum of both inputs. Or in terms of complexity: O(N+M). Generally speaking, if we sort the data prior to the Join Merge, we are more efficient because of the way the Merge algorithm works.
The customer explained that they are using the join as a method to wait for all rows in a stream before continuing processing i.e. They want to wait for all rows to be inserted into a table before querying that table. They do not want to concatenate two data sets together as a union snap would. And they want to wait for both input streams to finish before outputting, which the union does not do.
In the documentation for the Join snap https://docs-snaplogic.atlassian.net/wiki/spaces/SD/pages/1439005/Join (https://docs-snaplogic.atlassian.net/wiki/spaces/SD/pages/1439005/Join) it says “If you select Merge, the documents from the input views are merged into one document. You do not have to specify any other join properties when merging documents.” This tells them is doesn’t matter what the join criteria is on a merge and that it doesn’t look for matches and therefore it’s even less important to be sorted.
So they do not want feel they should have to sort the data before the joins. Anyone have any additional technical recommendations/suggestions around why using using a sort snap will improve the efficiency and in cases like this, avoid the “hung” condition on the Join snap?
Thanks!
Rob