Cross Validation is located under Model Studio () under Sampling, in Data Preparation, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis.
Refer to Properties of Cross Validation.
Cross Validation is a resampling technique used to evaluate the accuracy of the model. It is used mainly when the available input data is limited.
The dataset is divided into a number of groups called k folds. Hence, the process is generally called k-fold cross-validation.
Consider that k = 5. It means that, the dataset is divided into five equal parts. After division, the steps mentioned below are executed five times, each time with a different holdout set.
Each fold would be considered as training data in each iteration. The remaining folds are considered as test data.
In the first iteration, data is split into five folds - fold one is train data, and the other are test data.
This is repeated until each of the five folds are considered as test data.
As this technique involves an iteration process, data used for training in one iteration is used for testing in the next iteration.
The table given below describes the different fields present on the properties of Cross Validation.
Field | Description | Remark |
---|---|---|
Run | It allows you to run the node. | - |
Explore | It allows you to explore the successfully executed node. | - |
Vertical Ellipses | The available options are
| - |
Task Name | It is the task selected on the workbook canvas. | You can click the text field to edit or modify the name of the task as required. |
Number Of Folds | It allows the dataset to be split into the given number of folds. | ā |
Shuffle | It allows you to select whether or not to shuffle the input data while creating the different folds. | Its values are either True or False. True: The data is shuffled before splitting into folds. False: The data is not shuffled before splitting into folds. |
Random Seed | It is the value that builds a pattern in random data. This ensures that the data is split in the same pattern every time the code is re-run. | ā |
Cross Validation can be applied to any dataset. You can apply any of the Classification or Regression models on the output obtained from Cross Validation.
The value of k should be chosen carefully to avoid misrepresentation of the data.
The value of k is chosen as mentioned below:
Consider a flower dataset with 150 records. A snippet of input data is shown in the figure given below.
We apply Cross Validation to the input data. The output of Cross Validation is given as input to the Regression model, Ridge Regression.
The result displays the Regression Statistics for each of the folds, as shown in the figure below.
The final score for each of the different metrics on complete data is also displayed.
The result also displays Fold-wise Cross Validation (CV) Score, Standard Deviation, and Mean Score of all the CV scores.
Similarly, you can use Train Test Split and test any other Classification or Regression models performance