Cross Validation

Cross Validation

Cross Validation

Description

The dataset is divided randomly into a number of groups called k folds. Each fold is considered training data during each iteration, and the remaining folds are considered test data. This is repeated until each of the folds are considered as test data.

Why to use

  • It evaluates the accuracy of the model with an unknown dataset.
  • It gives a less biased or less optimistic estimate of the model.
  • It prevents the overfitting of the training dataset.

When to use

  • Input data is limited.

  • Models with iterative processes.

When not to use

A sufficient amount of data is available to train the model.

Prerequisites


Input

A dataset that contains any form of data – Textual, Categorical, Date, Numerical.

Output

The dataset split into k folds.

Statistical Methods used

  • Confusion Matrix

  • F Score

  • Adjusted R Square

  • R Square

Limitations

  • As the algorithm is re-run from scratch k times, the training time for evaluation.
  • Computationally expensive.
  • It does not work on sequential data (such as time series).

Cross Validation is a resampling technique used to evaluate the accuracy of the model. It is used mainly when the available input data is limited.

The dataset is divided into a number of groups called k folds. Hence, the process is generally called k-fold cross-validation.

Consider that k = 5. It means that, the dataset is divided into five equal parts. After division, the steps mentioned below are executed five times, each time with a different holdout set.

  • The dataset is shuffled randomly.
  • The dataset is split into five folds (k = 5).

Each fold would be considered as training data in each iteration. The remaining folds are considered as test data.

In the first iteration, data is split into five folds - fold one is train data, and the other are test data.

This is repeated until each of the five folds are considered as test data.

As this technique involves an iteration process, data used for training in one iteration is used for testing in the next iteration.

    • Related Articles

    • Cross Validation

      Cross Validation Description The dataset is divided randomly into a number of groups called k folds. Each fold is considered training data during each iteration, and the remaining folds are considered test data. This is repeated until each of the ...
    • Random Forest Regression

      Random Forest Regression Description Random Forest Regression is an ensemble learning method that combines multiple decision trees to create a powerful predictive model for continuous target variables. It utilizes random feature selection to improve ...
    • Gradient Boosting in Classification

      Gradient Boosting in Classification Description Gradient boosting is a machine learning algorithm. It is a learning method that combines multiple predictive models like decision tree to create a strong predictive model. Why use High Predictive ...
    • MLP (Multi-Layer Perceptron) Neural Network in Regression

      MLP (Multi-Layer Perceptron) Neural Network in Regression Description An MLP neural network for regression is designed to predict continuous numerical values. It consists of multiple layers, including an input layer, one or more hidden layers, and an ...
    • Rubiscape Winter '20

      New Features Platform & Studio New Datasets: Google Sheets – Ability to create, edit, delete Google Sheets dataset Microsoft SQL Server Analysis Services (SSAS) – Ability to use SSAS to create, edit, delete datasets MongoDB Connectivity – Ability to ...