Count Vectorizer

Count Vectorizer
Description	Transforms a collection of texts into a sparse matrix at the token level, based on the frequency of each unique word (feature) in the whole text (dictionary).
Why to use	For vectorization of multiple texts in a dictionary.
When to use	To convert each word in a text into a vector, based on the frequency of the word in the dictionary.	When not to use	On numerical data.
Prerequisites	The input variable should be of text type. The input variable should be processed text.
Input	Any dataset that contains text data.	Output	A sparse matrix in which each number in each cell represents the count of a word in a particular text. Each column of the matrix represents a feature from the dictionary.
Statistical Methods used	N-gram Stop words	Limitations	Cannot be used on data other than text data.

The terms that are useful in understanding CountVectorizer are given below.

Text – it is a single text data point within a textual dataset, for example, a user review on a product X.

Textual Dataset – it is a collection of all the texts, for example, a collection of all user reviews for the product X.

Feature – each unique word in the textual dataset.
Consider the example given below.

Consider the textual dataset with the following user reviews as sample texts:

[“Product X is overrated”, “It is good”, “Product X needs improvement”]

where,

Product X is overrated is text0.
It is good is text1.
Product X needs improvement is text2.

Here, the CountVectorizer creates a sparse matrix in which each word in the three texts is mapped as a real number in the corresponding feature vector. Each column in the sparse matrix represents a feature. Each feature has an index number. Each text from the document is a row in the sparse matrix. The value of each cell is the count of the feature in that particular text.

This can be visualized as below.

Feature →	Product	x	is	overrated	it	good	needs	improvement
text0	1	1	1	1	0	0	0	0
text1	0	0	1	0	1	1	0	0
text2	1	1	0	0	0	0	1	1
Count →	2	2	2	2	1	1	1	1

Thus, the blue matrix is the actual representation of the sparse matrix for the example.

Related Articles
Rubiscape Spring '22
New Features Platform & Studio Rubiscape Persistent variables in workflow and workbook - The user can declare a variable to be remembered between function calls Separate Service for Visualization - Provide separate service for Visualization which ...
Poisson Regression
Poisson Regression Description Poisson Regression is a type of linear regression used to model the countable data. Why to use For regression analysis of count data When to use For numerical variables When not to use For textual variables ...
Aggregation
Aggregation Description Aggregation of categorical data involves the gathering of information for statistical analysis and expressing it in a summarized form. Why to use Numerical Analysis – Data Preparation When to use When you want to collect ...
Aggregation
Aggregation Description Aggregation of categorical data involves the gathering of information for statistical analysis and expressing it in a summarized form. Why to use Numerical Analysis – Data Preparation When to use When you want to collect ...
Text Vectorization
The standard way of text vectorization is to define a fixed-length vector of unique words (features) from a predefined dictionary. Each entry in the vector corresponds to a unique word from the dictionary. The size of the vector is then equal to the ...

Count Vectorizer

Count Vectorizer

Related Articles

Rubiscape Spring '22

Poisson Regression

Aggregation

Aggregation

Text Vectorization