We are using the Match Snap to do fuzzy-comparisons and match Customer records without a unique primary key. This particular Snap is in the ML Data Preparation Snap pack. We traced a bottleneck to uses of the Match Snap that are using the Levenshtein comparison method, and that makes sense as this algorithm is expensive.
Has anyone found any techniques for making Levenshtein comparisons faster like record set binning, pre-sort, or other methods can could make it faster than the just-out-of-the-box usage? I could potentially use direct calls to Postgres to use the native Levelshtein distance functions but I doubt that would be better and probably worse.
Brent Van Allen
Keywords: Match Performance, Levenshtein, Fuzzy Matching, Data Prep, Machine Learning