TF-IDF |
Description | TF-IDF stands for Term Frequency-Inverse Document Frequency. TF-IDF transforms a collection of texts into a matrix of TF-IDF features. It measures the TF-IDF score of a feature, based on the importance and frequency of the feature in text and dictionary. |
Why to use | For TF-IDF vectorization of multiple texts in a dictionary. |
When to use | - To extract keywords
- To retrieve information that represents the importance of each feature in a dictionary.
| When not to use | On numerical data. |
Prerequisites | - The input variable should be of text type.
- The input variable should be processed text.
|
Input | Any dataset that contains text data. | Output | - A document term frequency matrix that displays the TF-IDF score of each feature in the dictionary.
- Each column of the matrix represents a feature from the dictionary
|
Statistical Methods used | - N-gram
- Stop words
- Term Frequency (TF)
- Inverse Document Frequency (IDF)
| Limitations | Cannot be used on data other than text data. |