Outlier Detection

Outlier Detection

Outlier Detection

Description

Outlier Detection reveals the extreme values that deviate from the rest of the data in a real-world dataset.

Why to use

Numerical Analysis – Data Preparation 

When to use

When there are certain values in the data which significantly deviate from the rest of the data. 

When not to use

On textual data.
When there are no outliers in the data.

Prerequisites

It should be used on numerical data. 

Input

Dataset with extreme values. 

Output

Dataset with extreme values either removed or imputed with mean, median, or mode.

Statistical Methods used

  • Outside of 1.5 IQR Rule
  • Outside of 5th and 95th Percentile Range
  • Outside of 2nd and 98th Percentile Range
  • 3 Standard Deviations from the Mean
  • Mean
  • Median
  • Mode 

Limitations

-

An outlier is a data value that is unlike the rest of the data. It is rare, or distinct, and does not fit in with the rest of the data.
There are many ways data can end up with outliers. For example,

  • In case of consumer data for an e-commerce site, there might be very few customers buying products in huge quantity.
  • In case of average mortality rate, there could be very few people who live beyond 100 years of age.

Most algorithms (including scikit-learn) will give you incorrect results if there are outliers present in the data. That is because, these estimators assume that all values fall in a particular range. So, it is recommended to use the Outlier Detection method to identify these rare and extreme values. This detection and correction of outliers helps to generate a uniform dataset.
There are multiple outlier detection methods available. Few of them are listed below.

  • Standard Deviation Method
  • Interquartile Range Method
  • Automatic Outlier Detection

Outlier Detection Methods in rubiscape

The outlier detection methods in rubiscape are listed below.

  • Outside of 1.5 IQR Rule – Any value which is more than 1.5*IQR (1.5 times of IQR) above the third quartile or below the first quartile is considered as an outlier.
  • Outside of 5th and 95th Percentile Range –

Any value below the 5th percentile and above the 95th percentile of the dataset is considered as an outlier.

  • Outside of 2nd and 98th Percentile Range – Any value below the 2nd percentile and above 98th percentile of the dataset is considered as an outlier.
  • 3 Standard Deviations from Mean – Any value which falls outside of 3 standard deviations from the mean is considered as an outlier.

Outlier Correction Methods in rubiscape

The outlier correction methods in rubiscape are listed below.

  • Replace by mean – It replaces the outlier values with mean of the in-range values.
  • Replace by median - It replaces the outlier values with median of the in-range values.
  • Replace by mode - It replaces the outlier values with mode of the in-range values.
    • Related Articles

    • Outlier Detection

      Outlier Detection Description Outlier Detection reveals the extreme values that deviate from the rest of the data in a real-world dataset. Why to use Numerical Analysis – Data Preparation When to use When there are certain values in the data which ...
    • Anomaly Detection

      Anomaly detection is the discovery or classification of events or observations that differ substantially from most of the data. Anomalies are also known as outliers, deviations, novelties, exceptions, or noise. Anomaly detection is categorized into ...
    • Local Outlier Factor

      Local Outlier Factor Description The Local Outlier Factor (LOF) algorithm is an unsupervised machine learning algorithm based on the concept of local density. It compares the density of data points in the distribution to the density of the ...
    • One Class SVM

      One-Class SVM Transformation Description One-Class Support Vector Machine (One Class SVM) is an unsupervised variation of SVM used for anomaly detection. One-Class SVM is an unsupervised algorithm for outlier detection. It detects whether a new data ...
    • Rubiscape Winter '22

      New Features Platform & Studio On-Prem Autoscaling Support for horizontal autoscaling for on-prem deployments of Rubiscape. Data Cleaning Ability to fix common data quality issues such as remove/replace null data, remove punctuations, capitalization, ...