Count Vectorizer | |||
Description | Transforms a collection of texts into a sparse matrix at the token level, based on the frequency of each unique word (feature) in the whole text (dictionary). | ||
Why to use | For vectorization of multiple texts in a dictionary. | ||
When to use |
| When not to use | On numerical data. |
Prerequisites |
| ||
Input | Any dataset that contains text data. | Output |
|
Statistical Methods used |
| Limitations | Cannot be used on data other than text data. |
The terms that are useful in understanding CountVectorizer are given below.
Text – it is a single text data point within a textual dataset, for example, a user review on a product X.
Textual Dataset – it is a collection of all the texts, for example, a collection of all user reviews for the product X.
Feature – each unique word in the textual dataset.
Consider the example given below.
Consider the textual dataset with the following user reviews as sample texts:
[“Product X is overrated”, “It is good”, “Product X needs improvement”]
where,
Here, the CountVectorizer creates a sparse matrix in which each word in the three texts is mapped as a real number in the corresponding feature vector. Each column in the sparse matrix represents a feature. Each feature has an index number. Each text from the document is a row in the sparse matrix. The value of each cell is the count of the feature in that particular text.
This can be visualized as below.
Feature → | Product | x | is | overrated | it | good | needs | improvement |
text0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
text1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
text2 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
Count → | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 |
Thus, the blue matrix is the actual representation of the sparse matrix for the example.