Group by fields with specific group size

Hello All,

Is there a way we can use group by fields limited to a group size? Group by fields snap doesn’t let us specify a group size, so if we try to create groups based on a field, whether the occurrence of the field is 5 times or 500 times, output of group by fields snap would always be one group with all the object with that one common field together.

What if we want to create different groups with similar field of similar size. For instance, if there is a field called ‘id’ with value ‘5’ and it occurs 20 times in an input, can we create 4 groups out of it with each group containing 5 values. If it were 22, it should create 5 groups.

This is like using group by N and group by field snap together.

Any suggestions or help on this would be really great.

Thank you.

@Tanmay_Sarkar

What if we want to create different groups with similar field of similar size. For instance, if there is a field called ‘id’ with value ‘5’ and it occurs 20 times in an input, can we create 4 groups out of it with each group containing 5 values. If it were 22, it should create 5 groups.

You can try with:
After grouping the records by field, to have child pipeline that will receive each group of records, will split, and then using the Group by N Snap to group the records again with specifying the group size.

Regards,
Spiro Taleski

1 Like

Hello @Spiro_Taleski Thank you for suggesting this.

This seems to be the quickest and the simplest solution to get this done.

Thanks,
Tanmay

If the reason you’re asking is a concern over memory use if the group sizes become too large, then I’d like to draw your attention to a new feature of the Group By Fields snap in our November (4.27) release. The snap has a new setting called Memory Sensitivity, a menu with two options:

  • None: Same as the current behavior, where all consecutive documents with the same value for the grouping field(s) are placed together in a single output document. One output document per group.

  • Dynamic: The snap may choose to split each group into multiple “parts”. So one output document per part rather than per group. In this mode, every output document contains a new “partInfo” object that contains metadata about that part, even for groups that only have a single part:

"partInfo" : {
      "groupSize" : 21,
      "numParts" : 5,
      "partIndex" : 1,
      "partSize" : 5
    }

When Dynamic mode is selected, the snap tracks statistics about available memory and the sizes of all groups processed so far during the pipeline’s execution to determine whether to split each new group. There’s also a new setting called Min. Part Size to specify a lower limit on part size when a group is split into multiple parts.

I realize this isn’t quite what you are asking for here, but I thought you should be aware of this feature.

1 Like

Thanks a lot @ptaylor we are yet to discover and test the new feature of Group By Fields snap post November 4.27 release, but the lucid explanation given by you is going to help incredibly. I am going to share this with my team.