Train Test Split

Train Test Split

Train Test Split

Description

The data is split randomly into train data and test data. Ideally, the split is in the ratio of 70:30 or 80:20 for train and test.

Why to use

To evaluate the accuracy of the model with an unknown dataset.

When to use

The dataset contains a large number of rows.

When not to use

Limited data is available.

Prerequisites


Input

Any dataset that contains any form of data – Textual, Categorical, Date, Numerical data.

Output

Dataset split into two parts – Train data and Test data.

Statistical Methods used

  • Confusion Matrix

  • F Score

  • Adjusted R Square

  • R Square

  • Root Mean Square Error

Limitations

If the data is limited, then there is a possibility of high bias.




The train-test split is a technique to evaluate the accuracy of a model. It is used to make predictions on a large dataset. It is appropriate where a good quick estimate of the model performance is required.

In this technique, the input dataset is divided into two datasets, train and test. The train dataset is used to fit the model by getting the model trained on the input dataset. The expected output of the data is known. The test dataset is used to make predictions on unknown data. It evaluates the performance of the model on new data.

The train-test split is used when sufficiently large data is available. The data in each of the train and test sets should ideally represent the problem. There should be enough records to cover all common and uncommon cases of the problem or situation. If the dataset size is not optimum, it may overfit or underfit the model.

    • Related Articles

    • Train Test Split

      Train Test Split Description The data is split randomly into train data and test data. Ideally, the split is in the ratio of 70:30 or 80:20 for train and test. Why to use To evaluate the accuracy of the model with an unknown dataset. When to use The ...
    • Train Test Split in Forecasting

      Train Test Split in Forecasting Description The data is split randomly into train data and test data. Ideally, the split is in the ratio of 70:30 or 80:20 for Train and test. Why to use To evaluate the accuracy of the model with an unknown dataset. ...
    • Shapiro-Wilk Test

      Shapiro-Wilk Test Description The Shapiro-Wilk test is a normality test in probability determination statistics. It is used to determine whether a simple random sample of a variable’s values has been derived from a normal distribution. Why to use For ...
    • One Sample T Test

      One Sample T Test Description A one-sample t-test is a statistical test for determining if the mean of a single sample varies significantly from a hypothesized population mean. Why to use To determine if there is statistical difference between sample ...
    • One Sample Z Test

      One Sample Z Test Description One-sample z-test is a statistical test used to determine if the mean of a single sample is significantly different, from a hypothesized population mean, when the population standard deviation is known. Why to use ...