TF-IDF

TF-IDF

TF-IDF

Description

TF-IDF stands for Term Frequency-Inverse Document Frequency.
TF-IDF transforms a collection of texts into a matrix of TF-IDF features. It measures the TF-IDF score of a feature, based on the importance and frequency of the feature in text and dictionary.

Why to use

For TF-IDF vectorization of multiple texts in a dictionary.

When to use

  • To extract keywords
  • To retrieve information that represents the importance of each feature in a dictionary.

When not to use

On numerical data.

Prerequisites

  • The input variable should be of text type.
  • The input variable should be processed text. 

Input

Any dataset that contains text data.

Output

  • A document term frequency matrix that displays the TF-IDF score of each feature in the dictionary.
  • Each column of the matrix represents a feature from the dictionary

Statistical Methods used

  • N-gram
  • Stop words
  • Term Frequency (TF)
  • Inverse Document Frequency (IDF)

Limitations

Cannot be used on data other than text data.


The terms that are useful in understanding TF-IDF are given below.
Term frequency – It represents the number of times a word (term) appears in a dictionary per the number of terms in the dictionary.
Document frequency – It represents the number of times a word appears in a dictionary.
Inverse document frequency – It represents the logarithm of the result of the number texts in a dictionary per the number of texts which contain a word. Thus, if the word is very common and appears in many texts, the value of IDF will approach zero, else it will approach one.
Thus, the TF-IDF score is computed as TF multiplied by IDF. The higher the TF-IDF score of a feature in the dictionary, the more relevant is the word from a text in the dictionary.
    • Related Articles

    • Text Vectorization

      Natural Language Processing requires transforming text into numbers for machines to understand and analyze the text. In NLP, it is required to convert text into a set of real numbers or vectors to extract useful information from the text. This ...
    • Text Vectorization

      The standard way of text vectorization is to define a fixed-length vector of unique words (features) from a predefined dictionary. Each entry in the vector corresponds to a unique word from the dictionary. The size of the vector is then equal to the ...
    • Rubiscape Spring '22

      New Features Platform & Studio Rubiscape Persistent variables in workflow and workbook - The user can declare a variable to be remembered between function calls Separate Service for Visualization - Provide separate service for Visualization which ...