11-09-2021 04:30 PM
We are using the Match Snap to do fuzzy-comparisons and match Customer records without a unique primary key. This particular Snap is in the ML Data Preparation Snap pack. We traced a bottleneck to uses of the Match Snap that are using the Levenshtein comparison method, and that makes sense as this algorithm is expensive.
Has anyone found any techniques for making Levenshtein comparisons faster like record set binning, pre-sort, or other methods can could make it faster than the just-out-of-the-box usage? I could potentially use direct calls to Postgres to use the native Levelshtein distance functions but I doubt that would be better and probably worse.
Thanks Much,
Brent Van Allen
Aduro, LLC.
Keywords: Match Performance, Levenshtein, Fuzzy Matching, Data Prep, Machine Learning