CountVectorizer is located under Textual Analysis ( ) in Text Vectorization, in the task pane on the left. Use the drag-and-drop method (or double-click on the node) to use the algorithm on the canvas. Click the algorithm to view and select different properties for analysis.
Refer to Properties of CountVectorizer.
The available properties of CountVectorizer are as shown in the figure given below.
The table given below describes the different fields present on the Properties pane of CountVectorizer.
Field | Description | Remark | |
---|---|---|---|
Run | It allows you to run the node. | - | |
Explore | It allows you to explore the successfully executed node. | - | |
Vertical Ellipses | The available options are
| - | |
Task Name | It displays the name of the selected task. | You can click the text field to edit or modify the name of the task as required. | |
Text | It allows you to select the text variable for which you need to perform the task. |
| |
Advanced | Lowercase | It converts the features to lowercase if selected as True. | The default value is True. |
Ngram Minimum Range | It determines the minimum probability of occurrence of each feature in a sequence of N words, where N = 1, 2, 3, and so on. |
| |
Ngram Maximum Range | It determines the maximum probability of occurrence of each feature in a sequence of N words where N = 1, 2, 3, and so on. |
| |
Stop Words | It allows you to add one or multiple stop words from the standard English set of stop words. |
|
Consider a dataset with one of the variables as a text variable. A snippet of the input data is shown in the figure given below.
In the Properties pane, the values are selected as shown in the table below.
Text | Text |
Lowercase | True |
Ngram Minimum Range | 1 |
Ngram Maximum Range | 1 |
Stop Words | None |
The first part of the Result of CountVectorizer is shown in the figure below.
The second part of the Result of CountVectorizer is shown in the figure below.
The Result page displays the Sparse Matrix for the selected text variable.
A Sparse matrix is a structure that contains as many rows as the data points and as many columns as the number of features. In the matrix,
Key Observations:
In the above example, the count of the feature code is 3. Its probability is 0.13636 and is calculated according to the Ngram Minimum Range and Ngram Maximum Range values entered in the Properties pane. No word is excluded from the features columns since stop words are not defined in the Properties pane.
Notes: |
|
You can click () on the CountVectorizer task node to publish the model. The model can be reused in a workbook and workflow for training and experimenting or can be used in a workflow for production. For more information on publishing a task, refer to Publishing Models.
A snippet of the text variable is shown in the figure given below.
On the Data page, you can see the
You can see that