Count Vectorizer

Count Vectorizer

Count Vectorizer

Description

Transforms a collection of texts into a sparse matrix at the token level, based on the frequency of each unique word (feature) in the whole text (dictionary).

Why to use

For vectorization of multiple texts in a dictionary.

When to use

  • To convert each word in a text into a vector, based on the frequency of the word in the dictionary.

When not to use

On numerical data.

Prerequisites

  • The input variable should be of text type.
  • The input variable should be processed text.

Input

Any dataset that contains text data.

Output

  • A sparse matrix in which each number in each cell represents the count of a word in a particular text.
  • Each column of the matrix represents a feature from the dictionary.

Statistical Methods used

  • N-gram
  • Stop words

Limitations

Cannot be used on data other than text data.


The terms that are useful in understanding CountVectorizer are given below.

Text – it is a single text data point within a textual dataset, for example, a user review on a product X.

Textual Dataset – it is a collection of all the texts, for example, a collection of all user reviews for the product X.

Feature – each unique word in the textual dataset.
Consider the example given below.

Consider the textual dataset with the following user reviews as sample texts:

[“Product X is overrated”, “It is good”, “Product X needs improvement”]

where,

  1. Product X is overrated is text0.
  2. It is good is text1.
  3. Product X needs improvement is text2.

Here, the CountVectorizer creates a sparse matrix in which each word in the three texts is mapped as a real number in the corresponding feature vector. Each column in the sparse matrix represents a feature. Each feature has an index number. Each text from the document is a row in the sparse matrix. The value of each cell is the count of the feature in that particular text.

This can be visualized as below.

Feature → 

Product

x

is

overrated

it

good

needs

improvement

text0

1

1

1

1

0

0

0

0

text1

0

0

1

0

1

1

0

0

text2

1

1

0

0

0

0

1

1

Count →

2

2

2

2

1

1

1

1

Thus, the blue matrix is the actual representation of the sparse matrix for the example.

    • Related Articles

    • Rubiscape Spring '22

      New Features Platform & Studio Rubiscape Persistent variables in workflow and workbook - The user can declare a variable to be remembered between function calls Separate Service for Visualization - Provide separate service for Visualization which ...
    • Poisson Regression

      Poisson Regression Description Poisson Regression is a type of linear regression used to model the countable data. Why to use For regression analysis of count data When to use For numerical variables When not to use For textual variables ...
    • Aggregation

      Aggregation Description Aggregation of categorical data involves the gathering of information for statistical analysis and expressing it in a summarized form. Why to use Numerical Analysis – Data Preparation When to use When you want to collect ...
    • Aggregation

      Aggregation Description Aggregation of categorical data involves the gathering of information for statistical analysis and expressing it in a summarized form. Why to use Numerical Analysis – Data Preparation When to use When you want to collect ...
    • Text Vectorization

      The standard way of text vectorization is to define a fixed-length vector of unique words (features) from a predefined dictionary. Each entry in the vector corresponds to a unique word from the dictionary. The size of the vector is then equal to the ...