There are three methods for this feature:
It compares the string values based on fuzzy logic and calculates a match score for each dataset row. Also, the user defines a match threshold value.
If the match score is more than or equal to the match threshold value, the two strings are said to be mapped; otherwise, they are not mapped.
For example, consider two records related to a customer's name, Stephan and Stefan. They repeatedly appear in customer data in multiple rows. The two names are similarly pronounced but have different spelling. You can compare the customer records for the two records using fuzzy logic. You can use the lookup functionality to map the two records.
It generates a phonetic key for both the Lookup Feature (selected from the Lookup connection) and Input Feature (selected from the predecessor node to lookup) based on the pronunciation of the word/sentence. It then produces a match score to see whether the two strings match.
It generates two phonetic keys for the Lookup Feature and Input Feature and then produces a match score.
Notes: |
|
Consider the datasets IRISDatasetModified and IRIS_Dataset with ID, Species, Sepal Length, Sepal Width, Petal Length, and Petal Width columns. They are used for selecting the Input Feature and Lookup Feature, respectively. The input data is shown below.
The Lookup feature dataset is shown below.
We apply the following condition to implement the Threshold Match method of fuzzy lookup.
Property | Value |
Lookup Connection | IRIS_Dataset |
Lookup Features to Return | Id |
If non-matching records found? | Capture Non-Matching Records |
A snippet of the Lookup output data is displayed in the figure below.
Ids | Mapped to Lookup_Id |
---|---|
1, 2, and 3 | 1 |
4, 5, 6 and 7 | 51 |
8, 9, 10, 11, and 12 | 101 |
For the same pair of datasets, we use the Metaphone method.
For this, we apply the following condition.
The phonetic key created for
setosa – STS
versicolor – FRSKLR
virginica – FRJNK
All the three values, 1, 0.8, and 0.7692, are greater than the match threshold of 0.7. Hence, all the species values are mapped.
For the same pair of datasets, we use the Double Metaphone method.
For this, we apply the following condition.
The double phonetic key created for
setosa – ('STS', ")
versicolor – ('FRSKLR', ")
iris-virginica – ('FRJNK', FRKNK ")
virginica – ('FRJNK', ")
All the four values, 1, 0.8, 0.5455, and 0.7692, are greater than the match threshold of 0.5. Hence, all the species values are mapped.