Cross Validation

Cross Validation
Description	The dataset is divided randomly into a number of groups called k folds. Each fold is considered training data during each iteration, and the remaining folds are considered test data. This is repeated until each of the folds are considered as test data.
Why to use	It evaluates the accuracy of the model with an unknown dataset. It gives a less biased or less optimistic estimate of the model. It prevents the overfitting of the training dataset.
When to use	Input data is limited. Models with iterative processes.	When not to use	A sufficient amount of data is available to train the model.
Prerequisites
Input	A dataset that contains any form of data – Textual, Categorical, Date, Numerical.	Output	The dataset split into k folds.
Statistical Methods used	Confusion Matrix F Score Adjusted R Square R Square	Limitations	As the algorithm is re-run from scratch k times, the training time for evaluation. Computationally expensive. It does not work on sequential data (such as time series).

Cross Validation is a resampling technique used to evaluate the accuracy of the model. It is used mainly when the available input data is limited.

The dataset is divided into a number of groups called k folds. Hence, the process is generally called k-fold cross-validation.

Consider that k = 5. It means that, the dataset is divided into five equal parts. After division, the steps mentioned below are executed five times, each time with a different holdout set.

The dataset is shuffled randomly.
The dataset is split into five folds (k = 5).

Each fold would be considered as training data in each iteration. The remaining folds are considered as test data.

In the first iteration, data is split into five folds - fold one is train data, and the other are test data.

This is repeated until each of the five folds are considered as test data.

As this technique involves an iteration process, data used for training in one iteration is used for testing in the next iteration.

Related Articles
Cross Validation
Cross Validation Description The dataset is divided randomly into a number of groups called k folds. Each fold is considered training data during each iteration, and the remaining folds are considered test data. This is repeated until each of the ...
Random Forest Regression
Random Forest Regression Description Random Forest Regression is an ensemble learning method that combines multiple decision trees to create a powerful predictive model for continuous target variables. It utilizes random feature selection to improve ...
Rubiscape Winter '20
New Features Platform & Studio New Datasets: Google Sheets – Ability to create, edit, delete Google Sheets dataset Microsoft SQL Server Analysis Services (SSAS) – Ability to use SSAS to create, edit, delete datasets MongoDB Connectivity – Ability to ...
Gradient Boosting in Classification
Gradient Boosting in Classification Description Gradient boosting is a machine learning algorithm. It is a learning method that combines multiple predictive models like decision tree to create a strong predictive model. Why use High Predictive ...
MLP (Multi-Layer Perceptron) Neural Network in Regression
MLP (Multi-Layer Perceptron) Neural Network in Regression Description An MLP neural network for regression is designed to predict continuous numerical values. It consists of multiple layers, including an input layer, one or more hidden layers, and an ...

Cross Validation

Cross Validation

Related Articles

Cross Validation

Random Forest Regression

Rubiscape Winter '20

Gradient Boosting in Classification

MLP (Multi-Layer Perceptron) Neural Network in Regression