Cross Validation | |||||
Description | The dataset is divided randomly into a number of groups called k folds. Each fold is considered training data during each iteration, and the remaining folds are considered test data. This is repeated until each of the folds are considered as test data. | ||||
Why to use |
| ||||
When to use |
| When not to use | A sufficient amount of data is available to train the model. | ||
Prerequisites | |||||
Input | A dataset that contains any form of data – Textual, Categorical, Date, Numerical. | Output | The dataset split into k folds. | ||
Statistical Methods used |
| Limitations |
|
Cross Validation is a resampling technique used to evaluate the accuracy of the model. It is used mainly when the available input data is limited.
The dataset is divided into a number of groups called k folds. Hence, the process is generally called k-fold cross-validation.
Consider that k = 5. It means that, the dataset is divided into five equal parts. After division, the steps mentioned below are executed five times, each time with a different holdout set.
Each fold would be considered as training data in each iteration. The remaining folds are considered as test data.
In the first iteration, data is split into five folds - fold one is train data, and the other are test data.
This is repeated until each of the five folds are considered as test data.
As this technique involves an iteration process, data used for training in one iteration is used for testing in the next iteration.