Train Test Split | |||
---|---|---|---|
Description | The data is split randomly into train data and test data. Ideally, the split is in the ratio of 70:30 or 80:20 for train and test. | ||
Why to use | To evaluate the accuracy of the model with an unknown dataset. | ||
When to use | The dataset contains a large number of rows. | When not to use | Limited data is available. |
Prerequisites | |||
Input | Any dataset that contains any form of data – Textual, Categorical, Date, Numerical data. | Output | Dataset split into two parts – Train data and Test data. |
Statistical Methods used |
| Limitations | If the data is limited, then there is a possibility of high bias. |
The train-test split is a technique to evaluate the accuracy of a model. It is used to make predictions on a large dataset. It is appropriate where a good quick estimate of the model performance is required.
In this technique, the input dataset is divided into two datasets, train and test. The train dataset is used to fit the model by getting the model trained on the input dataset. The expected output of the data is known. The test dataset is used to make predictions on unknown data. It evaluates the performance of the model on new data.
The train-test split is used when sufficiently large data is available. The data in each of the train and test sets should ideally represent the problem. There should be enough records to cover all common and uncommon cases of the problem or situation. If the dataset size is not optimum, it may overfit or underfit the model.