cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Problem with copy and join

Wassim
New Contributor

Hello,

I am working on this pipeline
big

The problem is with this part
small

I copy my result set.
I keep my first output
And aggregate my second output
And finally I want to take those aggregate results using a lookup.

This make my pipeline run endlessly. And if I remove the lookup (or join) and write to two different files it takes less than 2 minutes.
I think may be its because the two outputs are the image of the same result set.

Could you please tell me if you have seen this issue before and how to treat it.

Thank you

1 ACCEPTED SOLUTION

ptaylor
Employee
Employee

Hi Wassim,

This is actually a known issue (SWAT-3096) that weโ€™re working on a fix for. It happens when there are at least 1024 records being copied by the Copy snap, for reasons that are a bit difficult to explain.

Until we have a fix, there are at least three workarounds:

  • Swap the order of the inputs to the Lookup snap, so that the output of the Aggregate is the first input rather than the second.
  • Insert a Sort snap right after each output of the Copy snap. It wonโ€™t work if you put the Sort before the Copy. In this workaround, the point of the Sort snaps isnโ€™t to sort the data, which might already be sorted โ€“ itโ€™s to essentially create independent buffers of the data from each of the Copy snapโ€™s output views.
  • Replace the Lookup with a Join, and set the Sorted streams property to Unsorted.

View solution in original post

11 REPLIES 11

SpiroTaleski
Valued Contributor

@Wassim

Below are some points that you should take about In-Memory Lookup Snap:

  • The join operation within the snap will start, when the right input document stream ends. Meaning that, in your case the snap first will wait the aggregation of the data to be completed and then processed by In-Memory Lookup.
  • All the right input data is loaded in memory(of the JVM) as a lookup data. So, it is possible for the Snap to cause a poor performances.

Did you have another processes that are running in parallel with this process, that are also using similar snaps(join, snaps for aggregation, group snaps etc.) inside, which have an impact of the memory?

Did you tried the same scenario using JOIN Snap?

Regards,
Spiro Taleski

Wassim
New Contributor

@Spiro_Taleski
Thank you for the answer.
i did try join also.
i am aware of all that. i have 6000 rows. to aggregate to join to my first result and its not even moving
its like this
image

if i duplicate my snaps and make the aggregate and the join to the intial snaps it will take less than a minute.

Regards

Wassim
New Contributor

here is a simple example
test copy and join_2021_09_08.slp (14.2 KB)

test - 2021-09-03T152531.018.xlsx (740.4 KB)

ptaylor
Employee
Employee

Hi Wassim,

This is actually a known issue (SWAT-3096) that weโ€™re working on a fix for. It happens when there are at least 1024 records being copied by the Copy snap, for reasons that are a bit difficult to explain.

Until we have a fix, there are at least three workarounds:

  • Swap the order of the inputs to the Lookup snap, so that the output of the Aggregate is the first input rather than the second.
  • Insert a Sort snap right after each output of the Copy snap. It wonโ€™t work if you put the Sort before the Copy. In this workaround, the point of the Sort snaps isnโ€™t to sort the data, which might already be sorted โ€“ itโ€™s to essentially create independent buffers of the data from each of the Copy snapโ€™s output views.
  • Replace the Lookup with a Join, and set the Sorted streams property to Unsorted.

Wassim
New Contributor

thank you very much @ptaylor