Outliers¶

1. Definition of Outlier¶

Definition: An outlier is a data point that significantly deviates from the other observations in a dataset. These points are statistically unusual and differ substantially from the other data points.

Univariate Outliers:
Definition: Data points that deviate significantly from other observations within a single variable.
Multivariate Outliers:
Definition: Data points that are outliers when considering the relationship between multiple variables in a multidimensional space.

Statistical Methods:
- Z-Score: If data is normally distributed, the Z-score can be used to identify outliers. Points with a Z-score greater than 3 or less than -3 are typically considered outliers.
- Formula: [ Z = \frac{(X - \mu)}{\sigma} ] where ( X ) is the data point, ( \mu ) is the mean, and ( \sigma ) is the standard deviation.
- IQR (Interquartile Range): The IQR method uses the interquartile range to identify outliers. Data points below ( Q1 - 1.5 \times IQR ) or above ( Q3 + 1.5 \times IQR ) are considered outliers.
- Formula: [ \text{IQR} = Q3 - Q1 ] Outliers are those points that fall outside the range: ( [Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR] ).
Visualization Methods:
- Boxplot: A boxplot is a common tool for identifying univariate outliers, graphically displaying the quartiles of the data and highlighting potential outliers.
- Scatter Plot: A scatter plot can be used to identify multivariate outliers, especially when outliers appear in relation to multiple dimensions simultaneously.
Machine Learning Methods:
- Isolation Forest: A tree-based algorithm used to detect outliers in high-dimensional data.
- LOF (Local Outlier Factor): Identifies outliers by comparing the local density of data points with their neighbors.

Removing Outliers:
- Definition: If outliers are confirmed to be due to errors or they don't add value to the analysis, they can be removed from the dataset.
Data Transformation:
- Definition: Data transformations (e.g., log transformation or square root transformation) can be applied to reduce the impact of outliers.
Replacing with Median or Mean:
- Definition: In some cases, outliers can be replaced with the median or mean of the data.
Binning:
- Definition: Binning the data so that outliers are grouped into special bins can reduce their impact on the model.
Robust Algorithms:
- Definition: Using robust algorithms that are less sensitive to outliers (e.g., decision trees, random forests) can minimize the effect of outliers.

Retaining Meaningful Outliers:
- Definition: In some cases, outliers may represent important patterns or anomalies and should not be removed without careful consideration.
Multiple Verification:
- Definition: Outliers should be verified multiple times before removal or modification to ensure they are indeed outliers.