08-29-2017 02:20 PM
Hi all,
I was wondering if anyone has implemented any kind of fuzzy matching pipeline that uses a matching algorithm, such as Levenshtein, to find similar entries in the data field being looked at?
My use case is, I have a list of account names and would like to fuzzy match to my existing account names (from another source such as Redshift DB table or file) using a fuzzy match algorithm and get back a list of results showing a match and % weight of accuracy. That way I can use the results to determine if I want to delete one of the accounts due to duplication if weight is >= 95%, or send for review if matches are between 75 thru 94%, and ignore the rest. We get sometimes requests to cleanup or identify possible duplicate accounts and I would love to do this thru SnapLogic and not the other tool we’re using that has this capability.
The Levenshtein algorithm can be scripted using Python but requires certain libraries to be used. We’re not limited nor prefer just this one algorithm although this Levenshtein one has been used the most by the team with great success. I am just not sure if there is a way to register the necessary libraries on our Snaplex servers in order to call them in the Python script snap. Or if there are other ways to accomplish this task.
Any recommendations are welcome. Thanks ahead of time.
08-29-2017 03:36 PM
You probably dont need to use libraries for this.
There are some built in functions sl.range function and lamda functions of map on the arrays to come up with an algorithm.
sl.range($word.length).map(len => $word.toLowerCase().slice(0, len + 1))
assign this to wordArray
$wordArray.map(elem => { wordArray: elem, found: $inputArray.indexOf(elem) != -1 })
Lambda functions are very powerful and you can put this to your use for coming up with the algorithm.