Lookup for Categorical Variables

Lookup for Categorical Variables

The fuzzy lookup is based on the fuzzy logic in mathematics. It is supported only for categorical variables.

Methods

There are three methods for this feature:

Threshold Matching:

It compares the string values based on fuzzy logic and calculates a match score for each dataset row. Also, the user defines a match threshold value.
If the match score is more than or equal to the match threshold value, the two strings are said to be mapped; otherwise, they are not mapped.
For example, consider two records related to a customer's name, Stephan and Stefan. They repeatedly appear in customer data in multiple rows. The two names are similarly pronounced but have different spelling. You can compare the customer records for the two records using fuzzy logic. You can use the lookup functionality to map the two records.

Metaphone:

It generates a phonetic key for both the Lookup Feature (selected from the Lookup connection) and Input Feature (selected from the predecessor node to lookup) based on the pronunciation of the word/sentence. It then produces a match score to see whether the two strings match.

Double Metaphone:

It generates two phonetic keys for the Lookup Feature and Input Feature and then produces a match score.

(info)

Notes:

  • The match threshold value should be between 0 and 1 and is defined by the user.

  • The rows are matched only if the match score is greater than the match threshold value.

Example of Fuzzy Logic using Threshold Matching

Consider the datasets IRISDatasetModified and IRIS_Dataset with ID, Species, Sepal Length, Sepal Width, Petal Length, and Petal Width columns. They are used for selecting the Input Feature and Lookup Feature, respectively. The input data is shown below.


The Lookup feature dataset is shown below.

We apply the following condition to implement the Threshold Match method of fuzzy lookup.


The following Lookup properties are selected as below.

Property

Value

Lookup Connection

IRIS_Dataset

Lookup Features to Return

Id

If non-matching records found?

Capture Non-Matching Records

A snippet of the Lookup output data is displayed in the figure below.

 
Observations:
  • A Lookup_Flag value of 1 for all values indicates that for a Match Threshold of 0.7, all the three species values from input data are mapped. (for an unmapped value, the Lookup_Flag value is zero.)
  • The Id_Lookup column shows those row IDs in the IRIS_Dataset whose species values are mapped with those from the IRISDatasetModified dataset. Each Id is mapped with the first value from the Lookup_Id whose match score is greater than the threshold.
IdsMapped to Lookup_Id
1, 2, and 31
4, 5, 6 and 751
8, 9, 10, 11, and 12101
For example, the Lookup_Id 51 (Lookup_Species) has a match score greater than the threshold for the versicolor species. Hence all the versicolor and iris-versicolor values from input data are mapped with it.
The result page displays the ratio value (match score) for the three species.


As you can see, all the three values, 1, 0.8, and 0.7826 are greater than the match threshold of 0.7. Hence, all the species values are mapped.
Example of Fuzzy Logic using Metaphone

For the same pair of datasets, we use the Metaphone method.

For this, we apply the following condition.



A snippet of the Lookup output data is displayed in the figure below.

The result page displays the Lookup Metaphone phonetic keys and their ratio value (match score) for the three species.

As you can see,

  • The phonetic key created for

  • setosa – STS

  • versicolor – FRSKLR

  • virginica – FRJNK

  • All the three values, 1, 0.8, and 0.7692, are greater than the match threshold of 0.7. Hence, all the species values are mapped.

Example of Fuzzy Logic using Double Metaphone

For the same pair of datasets, we use the Double Metaphone method.
For this, we apply the following condition.



A snippet of the Lookup output data is displayed in the figure below



The result page displays the lookup double metaphone phonetic key and its ratio value (match score) for the three species.

As you can see,
  • The double phonetic key created for

  • setosa – ('STS', ")

  • versicolor – ('FRSKLR', ")

  • iris-virginica – ('FRJNK', FRKNK ")

  • virginica – ('FRJNK', ")

  • All the four values, 1, 0.8, 0.5455, and 0.7692, are greater than the match threshold of 0.5. Hence, all the species values are mapped.


    • Related Articles

    • Lookup

      Lookup Description Lookup helps you to match values of specified fields in two data sources. Why to use To compare values in data sources. When to use To determine the presence of a particular field from one data source in another data source. When ...
    • Lookup

      Lookup Description Lookup helps you to match values of specified fields in two data sources. Why to use To compare values in data sources. When to use To determine the presence of a particular field from one data source in another data source. When ...
    • Categorical Naive Bayes

      Categorical Naive Bayes Description The categorical Naïve Bayes algorithm is suitable for categorically discrete values like Weather Prediction, and Medical Diagnosis. It is the simplest and fastest classification algorithm. Why to use It is the ...
    • Chi Square Test for Independence

      Chi Square Test for Independence Description Chi Square Test for Independence determines whether two categorical variables are related or independent. Why to use To test the independence or association between categorical variables. When to use When ...
    • Rubiscape Spring '23

      New Features Studio Rubiscape Desktop All your favorite Rubiscape features, functions, and powerful dashboarding capabilities are now available in Rubiscape Desktop for Windows. With a completely new setup wizard, you can now customize your ...