Text Vectorization

Text Vectorization

Natural Language Processing requires transforming text into numbers for machines to understand and analyze the text. In NLP, it is required to convert text into a set of real numbers or vectors to extract useful information from the text. This process of converting strings/text into a meaningful array of real numbers (or vectors) is called vectorization.

Text vectorization maps words or phrases as real numbers to corresponding words from a vocabulary to find word predictions and similarities.

Text vectorization in NLP helps to perform the following textual analysis tasks:

  • Extract features for text classification.
  • Compute the occurrence of similar words.
  • Compute the probability of occurrence of similar words.
  • Compute the relevance of features in a text.
  • Predict the next words in a sequence of words.

In Rubiscape, two Text Vectorization algorithms are available.

  • CountVectorizer
  • TF-IDF (Term Frequency-Inverse Document Frequency)

In the task pane, click Textual Analysis, and then click Text Vectorization.

For more information, refer to Text Vectorization

    • Related Articles

    • Text Vectorization

      The standard way of text vectorization is to define a fixed-length vector of unique words (features) from a predefined dictionary. Each entry in the vector corresponds to a unique word from the dictionary. The size of the vector is then equal to the ...
    • Clustering

      Clustering is the process of grouping objects such that objects in the same group (cluster) are more similar to each other compared to those in different groups (clusters). Clustering algorithms try to group similar objects in one cluster and the ...
    • Performing Textual Analysis

      What is Textual Analysis Textual analysis is an automated process to interpret textual content and derive meaningful data from it. It is a qualitative analysis performed using AI-powered natural language processing (NLP) tools. These tools help ...
    • TF-IDF

      TF-IDF Description TF-IDF stands for Term Frequency-Inverse Document Frequency. TF-IDF transforms a collection of texts into a matrix of TF-IDF features. It measures the TF-IDF score of a feature, based on the importance and frequency of the feature ...
    • Count Vectorizer

      Count Vectorizer Description Transforms a collection of texts into a sparse matrix at the token level, based on the frequency of each unique word (feature) in the whole text (dictionary). Why to use For vectorization of multiple texts in a ...