Latent Dirichlet Allocation

Latent Dirichlet Allocation

LDA is located under Textual Analysis > Topic Modeling > Latent Dirichlet Allocation. Use drag-and-drop method to use algorithm in the canvas. Click the algorithm to view and select different properties for modeling.



Properties of Latent Dirichlet Allocation



The following table shows the description of Latent Dirichlet Allocation.

Field

Description

Remark

Run

It allows you to run the node.

-

Explore

It allows you to explore the successfully executed node.

-

Vertical Ellipses

The available options are

  • Run till node
  • Run from node
  • Publish as a model
  • Publish code

-

Task Name

It is the name of the task selected on the workbook canvas.

  • You can click the text field to edit or modify the name of the task, as required.
  • Space between words is not allowed in the Task Name.

Corpus

Corpus is a large and structured collection of text. It displays categorical and text columns present in the dataset.

You can select only one variable.

Number of Topics

Enter the required number of topics to be extracted from the corpus.

The default value is 5.

Advanced



Coherence Method

It evaluates the quality and interpretability of topics.

  • The available methods are:
    • c_v
    • u_mass
  • The default method is c_v.

Topic Range

You can specify the number of topics that the model can discover and represent.

The default value is 10.

Chunk Size

It represents number of topics to be used in each training chunk.

The default value is 2000.

Passes

It refers to the number of times the entire corpus is handled during the training.

The default value is 1.

Iterations

It specifies the maximum number of iterations allowed for each pass.

The default value is 50.

Random State

It allows you to enter the number to control the random number generator used for initializing the model.

-

Alpha

It controls the sparsity of the corpus.

The Default value is alpha='symmetric', which means all topics are equally likely in the corpus.

Gamma Threshold

It allows you to control the threshold for the topic.

The default is value is 0.0001, which means topics with a probability less than 0.001 are not assigned to words in the corpus.

Decay

It allows you to control the decrease rate in learning rate in online learning.

The default value is 0.5. It means that the learning rate is half after processing each chunk.

Minimum Probability

It filters out the topics with probabilities lower than the assigned value.

The default value is 0.01 which means topics with probabilities less than 0.01 are filtered out.

Example of the Latent Dirichlet Allocation

LDA (Latent Dirichlet Allocation) is a popular method for topic modeling. In this example, we apply LDA to BBC News dataset. Before connecting the LDA to the BBC News dataset, we prepare the data using various data preparation algorithms and build the workflow. Refer to the workflow shown below:



In the Properties pane, the following values were selected:


After the successful execution of the algorithm, we obtain the following result:


The result page displays:

  • Coherence Scores vs. Number of Topics
  • List of Assigned Topics
  • Intertopic Distance Map
  • Top-30 Most Salient Terms
  • λ Value Table
  • WordCloud Chart

    • Related Articles

    • Latent Dirichlet Allocation

      LDA is located under Textual Analysis > Topic Modeling > Latent Dirichlet Allocation. Use drag-and-drop method to use algorithm in the canvas. Click the algorithm to view and select different properties for modeling. Properties of Latent Dirichlet ...