Outlier Detection | |||||
Description | Outlier Detection reveals the extreme values that deviate from the rest of the data in a real-world dataset. | ||||
Why to use | Numerical Analysis – Data Preparation | ||||
When to use | When there are certain values in the data which significantly deviate from the rest of the data. | When not to use | On textual data. | ||
Prerequisites | It should be used on numerical data. | ||||
Input | Dataset with extreme values. | Output | Dataset with extreme values either removed or imputed with mean, median, or mode. | ||
Statistical Methods used |
| Limitations | - |
An outlier is a data value that is unlike the rest of the data. It is rare, or distinct, and does not fit in with the rest of the data.
There are many ways data can end up with outliers. For example,
Most algorithms (including scikit-learn) will give you incorrect results if there are outliers present in the data. That is because, these estimators assume that all values fall in a particular range. So, it is recommended to use the Outlier Detection method to identify these rare and extreme values. This detection and correction of outliers helps to generate a uniform dataset.
There are multiple outlier detection methods available. Few of them are listed below.
The outlier detection methods in rubiscape are listed below.
Any value below the 5th percentile and above the 95th percentile of the dataset is considered as an outlier.
The outlier correction methods in rubiscape are listed below.