Random Forest

Random Forest

Random Forest is located under Machine Learning (  ) in Classification, in the left task pane. Use the drag-and-drop method (or double-click on the node) to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of Random Forest.

Properties of Random Forest

The available properties of the Random Forest are as shown in the figure below.



The table given below describes the different fields present on the properties pane of Random Forest.

Field

Description

Remark

RunIt allows you to run the node-
ExploreIt allows you to explore the successfully executed node.-
Vertical Ellipses

The available options are

  • Run till node
  • Run from node
  • Publish as a model
  • Publish code
-

Task Name

It is the name of the task selected on the workbook canvas.

You can click the text field to edit or modify the task's name.

Dependent Variable

It allows you to select the dependent variable.

  • Only one feature can be selected.
  • Only categorical values should be selected.

Independent variables

It allows you to select the experimental or predictor variable(s).

  • Multiple data fields can be selected.
  • Only numerical values should be selected.
  • You can also add categorical columns but first you need to perform label encoding.

Advanced

Number of Estimators

It allows you to select the number of base estimators in the ensemble.

  • The number of estimators is the number of trees built by the algorithm.
  • The default value is 100.
  • Only numerical values can be entered.

Criterion

It allows you to select the Decision-making criterion to be used.

  • It is a tree-specific parameter.
  • It decides the quality of the split.
  • The available options are:
  • gini (default) (for Gini impurity)
  • entropy (for Information Gain)
  • For example, you want to select the root node. You can calculate the entropy for each variable, which gives its information gain. The node with maximum information gain is considered the root node.
  • Similarly, the node with a minimum value of the gini index is selected as the root node.

Maximum Features

It allows you to select the maximum number of features to be considered for the best split.

  • The available options are:
  • auto (default)
  • sqrt
  • log2
  • none
  • If you select
  • auto, it uses sqrt by default.
  • sqrt, it takes the square root of the number of independent variables as maximum features.
  • log2, it takes the logarithm of the number of independent variables as maximum features.
  • none, it considers all the independent variables as the maximum features.

Random State

It allows you to select a random combination of train and test for the classifier.

  • It randomizes the data splitting into train and test.
  • It provides different data each time for training and testing the model.
  • The model is expected to return the same result each time, even with different train-test combinations.
  • It ensures that the obtained results can be reproduced.
  • You can enter any integral value as the random state
  • This parameter is optional.

Maximum Depth

It allows you to set the length of the decision tree.

  • The more the depth of the tree, the more accurate the prediction.
  • However, more depth also takes more time and computation power.
  • So, it is advisable to choose an optimum depth.

Feature Selection Percentage

It is used to decide the feature importance of the selected variable.

  • It decides the features that would be displayed on the Data tab.
  • Only those features for which the sum of feature importance is less than or equal to the specified value are displayed.
  • The default value is 100, which is equivalent to 1.
  • For a value 1, all features are considered important.
  • You can select any numerical value between 1 and 100.
  • These features are most important for deciding the success of the classification model compared to other features.

Dimensionality Reduction

It allows you to select the dimensionality reduction technique.

  • The default value is None.
  • Only one data field can be selected.
  • The available options are,
  • None
  • PCA
  • Principal Component Analysis (PCA) maps the data linearly to a lower-dimensional space to maximize the variance of the data in the low-dimensional representation.

Add Result as a Variable

It allows you to select any of the result parameters as the variable.

  • You can select from the following performance parameters:
  • Accuracy
  • Sensitivity
  • Specificity
  • F-score

Node Configuration

It allows you to select the instance of the AWS server to provide control on the execution of a task in a workbook or workflow.

For more details, refer to Worker Node Configuration.

Hyperparameter Optimization

It allows you to select parameters for Hyperparameter Optimization.

For more details, refer to Hyperparameter Optimization.

Example of Random Forest

Consider an HR dataset with over 1400 rows and 30 columns. There are multiple features like Attrition, BusinessTravel, DailyRate, PercentSalaryHike, PerformanceRating, and so on in the dataset. The dataset can study the impact of multiple factors on employee attrition.
A snippet of input data is shown in the figure given below.


In the Properties pane, the following values are selected.

Property

Value

Dependent Variable

Attrition

Independent Variables

Age, DailyRate, Education, JobSatisfaction, PercentSalaryHike, StockOptionLevel, WorkLifeBalance

No. of Estimators

100

Criterion

gini

Maximum Features

auto

Random State

3

Maximum Depth

10

Feature Selection Percentage

60

Dimensionality Reduction

None

Add result as a variable

Accuracy, Sensitivity, Specificity, FScore


(info)

Notes:

  • The Feature Selection Percentage selected for configuration is 60, which is equivalent to a value of 0.6. It means that only those features for which the sum of feature importance is less than or equal to 0.6 are displayed in the Data tab.
  • These features are more important for deciding the success of the classification model compared to other features.


Since we select Accuracy, Sensitivity, Specificity, and FScore as the performance metrics, the following variables are created corresponding to the two Events of Interest, "Yes" and "No.". The value of accuracy remains same for both the events. Thus, you have eight 7 new variables created.
For example, Random_Forest_Accuracy_No, and Random_Forest_Accuracy_Yes are the variables created corresponding to the Accuracy metric for the events "Yes" and "No."


(info) Note:

As you can see, the Default and Current values for each variable are the same, and they are also the values for the performance metrics displayed on the Result page.


The Result Page for the Event of Interest "No" is shown below.


The Result page for the Event of Interest "Yes" is shown below.


The Result page displays

  • The performance metrics AccuracyFScoreSensitivity, and Specificity, for the two events of interest.
  • You can see that the accuracy for the two events is constant at 0.9313.
  • The three remaining metrics have different values for the two events.
  • The Confusion Matrix for the predicted and actual values of the two Events of Interest.
  • The shaded diagonal cells show the correctly predicted categories. For example, 1233 "No" and 136 "Yes" Attrition values are correctly predicted.
  • The remaining cells indicate the wrongly predicted categories. For example, 101 "Yes" Attrition values are wrongly predicted as "No."
  • The Receiver Operating Characteristics (ROC) charts for the two Events of Interest.
  • You can see that the ROC curve is identical for both events.
  • The Area Under Curve (AUC) for the ROC chart is 0.9906.
  • Since the value is high (close to 1), the Random Forest model is meaningful and clearly distinguishes between the two classes or events of interest (of Attrition).
  • The Lift Charts for the two Events of Interest.
  • You can see that the Lift Charts are different for the two events
  • The area of the region between the life curve (blue) and baseline (red) is different for the two events.
  • Since the area for the Event of Interest "Yes" is more, we can conclude that the Random Forest algorithm more clearly classifies this event.
  • The Feature Importance of the selected independent variables.
  • The feature importance is expressed as a decimal number
  • The features are arranged in descending order of their importance.
  • Thus, DailyRate is most important while WorkLifeBalance is the least important feature for deciding employee attrition.

The Data page displays

  • One additional Label column Two columns: Index and Label. Since the image for the Data tab does not show the index column, should we still add it? No.
  • The dependent variable column (Attrition)
  • The maximum importance features, that is, features for which the sum of feature importance is less than or equal to 0.6 (Age and DailyRate)
  • You can compare the corresponding values in the Label and Attrition columns and observe the correctly and incorrectly predicted values

    • Related Articles

    • Random Forest Regression

      Random Forest Regression is located under Machine Learning ( ) > Regression > Random Forest Regression Use the drag-and-drop method (or double-click on the node) to use the algorithm in the canvas. Click the algorithm to view and select different ...
    • Isolation Forest

      Isolation Forest is located under Machine Learning () in Anomaly Detection, in the left task pane. Use the drag-and-drop method (or double-click on the node) to use the algorithm in the canvas. Click the algorithm to view and select different ...
    • Classification

      Notes: The Reader (Dataset) should be connected to the algorithm. Missing values should not be present in any rows or columns of the reader. To find out missing values in a data, use Descriptive Statistics. Refer to Descriptive Statistics. If missing ...
    • Decision Tree

      Decision Tree is located under Machine Learning ( ) in Classification, in the task pane on the left. Use drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Refer to ...
    • Decision Tree Regression

      Decision Tree Regression is located under Machine Learning ( ) > Regression > Decision Tree Regression Use the drag-and-drop method (or double-click on the node) to use the algorithm in the canvas. Click the algorithm to view and select different ...