Forum Discussion

Wassim's avatar
Wassim
New Contributor
4 years ago
Solved

Problem with copy and join

Hello,

I am working on this pipeline

The problem is with this part

I copy my result set.
I keep my first output
And aggregate my second output
And finally I want to take those aggregate results using a lookup.

This make my pipeline run endlessly. And if I remove the lookup (or join) and write to two different files it takes less than 2 minutes.
I think may be its because the two outputs are the image of the same result set.

Could you please tell me if you have seen this issue before and how to treat it.

Thank you

  • Hi Wassim,

    This is actually a known issue (SWAT-3096) that we’re working on a fix for. It happens when there are at least 1024 records being copied by the Copy snap, for reasons that are a bit difficult to explain.

    Until we have a fix, there are at least three workarounds:

    • Swap the order of the inputs to the Lookup snap, so that the output of the Aggregate is the first input rather than the second.
    • Insert a Sort snap right after each output of the Copy snap. It won’t work if you put the Sort before the Copy. In this workaround, the point of the Sort snaps isn’t to sort the data, which might already be sorted – it’s to essentially create independent buffers of the data from each of the Copy snap’s output views.
    • Replace the Lookup with a Join, and set the Sorted streams property to Unsorted.

11 Replies

  • Hi Wassim,

    This is actually a known issue (SWAT-3096) that we’re working on a fix for. It happens when there are at least 1024 records being copied by the Copy snap, for reasons that are a bit difficult to explain.

    Until we have a fix, there are at least three workarounds:

    • Swap the order of the inputs to the Lookup snap, so that the output of the Aggregate is the first input rather than the second.
    • Insert a Sort snap right after each output of the Copy snap. It won’t work if you put the Sort before the Copy. In this workaround, the point of the Sort snaps isn’t to sort the data, which might already be sorted – it’s to essentially create independent buffers of the data from each of the Copy snap’s output views.
    • Replace the Lookup with a Join, and set the Sorted streams property to Unsorted.
    • Wassim's avatar
      Wassim
      New Contributor

      thank you very much @ptaylor

    • vgautam64's avatar
      vgautam64
      New Contributor III

      I too faced this exact problem recently and this thread proved really helpful. Thanks a lot!

      Considering this post is more than 1.5 years old now, what is the status on the fix for this issue?

  • viktor_n's avatar
    viktor_n
    Contributor II

    Hi @Wassim,

    Try first to sort the data before you send it to the join.

    How many records did you get before coping the records ?
    If it’s only one, then instead of using Join snap try it with Gate snap. You will get same result as it is with Join. If there are more records and you are joining by some conditions then you can not use Gate.

    Regards.

  • Wassim's avatar
    Wassim
    New Contributor

    Hi Viktor,

    Thank you for the answer.
    I have 6k rows so i cant use the gate.
    I sorted on the same field i group with and it doesnt help.

    • skatpally's avatar
      skatpally
      Former Employee

      Did you mean Join Snap when you referred to Lookup snap here ? Also what is the output from the Aggregate Snap and Mapper after that ?

  • Wassim's avatar
    Wassim
    New Contributor

    Hi skatpally thank you very much for the answer. its a lookup snap

    Here is the aggregate

    and here is the mapper after the aggregate

    thank you

  • SpiroTaleski's avatar
    SpiroTaleski
    Valued Contributor

    @Wassim

    Below are some points that you should take about In-Memory Lookup Snap:

    • The join operation within the snap will start, when the right input document stream ends. Meaning that, in your case the snap first will wait the aggregation of the data to be completed and then processed by In-Memory Lookup.
    • All the right input data is loaded in memory(of the JVM) as a lookup data. So, it is possible for the Snap to cause a poor performances.

    Did you have another processes that are running in parallel with this process, that are also using similar snaps(join, snaps for aggregation, group snaps etc.) inside, which have an impact of the memory?

    Did you tried the same scenario using JOIN Snap?

    Regards,
    Spiro Taleski

  • Wassim's avatar
    Wassim
    New Contributor

    @Spiro_Taleski
    Thank you for the answer.
    i did try join also.
    i am aware of all that. i have 6000 rows. to aggregate to join to my first result and its not even moving
    its like this

    if i duplicate my snaps and make the aggregate and the join to the intial snaps it will take less than a minute.

    Regards