Fuzzy Matching (Operator Toolbox)

Synopsis

Is operator allows you to merge two data sets in a fuzzy way based on two nominal attributes. This means it matches examples which are not necessarily equal, but similar.

Description

The operator takes one attribute from the left side example set and one from the right example set to match rows. If you want to perform a multi-attribute match, please check the Cross Distance operator.

Between the two chosen attribute we calculates a similarity. The operator merges the k most similar examples from both sides. If there are colliding attributes _from_ES2 is appended, as done by the Join operator.

The similarity method can be defined using the 'distance measure' parameter. Currently all similarity measures are Levenshtein distance based. Levenshtein distance is using the number of changes you need to do to get from one string to the other to define a distance. The used distance measures are taken from the fuzzywuzzy library. For a detailed explanation of the different options, please see: https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

Input

left (Data table)
The first example set used for matching.
right (Data table)
The second example set used for matching.

Output

matched (Data table)
The merged example set. This contains the union of both attributes from the left and the right side. If there are colliding attributes a ''_from_ES2'' is added to the right hand side's attribute. For each row of the left side we have up to ''number_of_matches'' rows with the closest match in the resulting table.

Parameters

left side attribute The attribute of the left hand side example set which should be used for merging.
right side attribute The attribute of the left hand side example set which should be used for merging.
number of matches Defines the maximum amount matches you want to find for each left hand side example.
similarity measure Similarity measure which should be used to determine a match.

Tutorial Processes

Find Words Similar to RapidMiner

In this tutorial we use 'Fuzzy Matching' to find the most similar spelling to RapidMiner. To do this we first create two example sets. One just having one row with "RapidMiner" and the other having multiple rows with possible alternative spellings. The fuzzy matching operator is then used to determine the 5 most similar spellings and their similarity.

Find Similar People on the Titanic

In this tutorial we try to find name which are similar either to "Andersson, Miss. Ingeborg Constanzia" or "Asplund, Master. Filip Oscar" who where on the titanic. To do so we first select the two people from the titanic data set. Then we fuzzy_match the remaining table with the remaining customers. The result of this is a data set with 5 similar names for each of our two selected people.

Categories

Versions