Outliers: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism.
Outliers are different from the noise data.
- Noise is random error or variance in a measured variable
- Noise should be removed before outlier detection
Outliers are interesting: It violates the mechanism that generates the normal data
- Outlier detection vs. novelty detection (identify new topics and trends in a timely manner in social media): early stage, outlier; but later merged into the model
- There applications are :
- Credit card fraud detection
- Telecom fraud detection
- Customer segmentation
- Medical analysis
Table of Contents
Importance of Anomaly Detection
Ozone Depletion History
- In 1977 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels
- Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The researchers held back publishing their work for nearly a decade.
- The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
Anomaly Detection
Challenges
- How many outliers are there in the data?
- Method is unsupervised
- Validation can be quite challenging (just like for clustering)
- Finding needle in a haystack
Working assumption:
- There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data
Types of Outliers
Three kinds: global, contextual and collective outliers
- Global outlier (or point anomaly)
- Object is Og if it significantly deviates from the rest of the data set
- Ex. Intrusion detection in computer networks
- Issue: Find an appropriate measurement of deviation
- Contextual outlier (or conditional outlier)
- Object is Oc if it deviates significantly based on a selected context
- Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
- Attributes of data objects should be divided into two groups
- Contextual attributes: defines the context, e.g., time & location
- Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature
- Can be viewed as a generalization of local outliers—whose density significantly deviates from its local area
- Issue: How to define or formulate meaningful context?
- Collective Outliers
- A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers
- Applications: E.g., intrusion detection:
- When a number of computers keep sending denial-of-service packages to each other
- Detection of collective outliers
- Consider not only behavior of individual objects, but also that of groups of objects
- Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects.
- A data set may have multiple types of outlier
- One object may belong to more than one type of outlier
Outlier Detection I: Supervised Methods
Two ways to categorize outlier detection methods:
- Based on whether user-labeled examples of outliers can be obtained:
- Supervised, semi-supervised vs. unsupervised methods
- Based on assumptions about normal data and outliers:
- Statistical, proximity-based, and clustering-based methods
- Outlier Detection I: Supervised Methods
- Modeling outlier detection as a classification problem
- Samples examined by domain experts used for training & testing
- Methods for Learning a classifier for outlier detection effectively:
- Model normal objects & report those not matching the model as outliers, or
- Model outliers and treat those not matching the model as normal
- Challenges
- Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial outliers
- Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers)
Outlier Detection II: Unsupervised Methods
- Assume the normal objects are somewhat “clustered’‘ into multiple groups, each having some distinct features
- An outlier is expected to be far away from any groups of normal objects
- Weakness: Cannot detect collective outlier effectively
- Normal objects may not share any strong patterns, but the collective outliers may share high similarity in a small area
- Ex. In some intrusion or virus detection, normal activities are diverse
- Unsupervised methods may have a high false positive rate but still miss many real outliers.
- Supervised methods can be more effective, e.g., identify attacking some key resources
- Many clustering methods can be adapted for unsupervised methods
- Find clusters, then outliers: not belonging to any cluster
- Problem 1: Hard to distinguish noise from outliers
- Problem 2: Costly since first clustering: but far less outliers than normal objects
- Newer methods: tackle outliers directly
Outlier Detection III: Semi-Supervised Methods
- Situation: In many applications, the number of labeled data is often small: Labels could be on outliers only, normal objects only, or both
- Semi-supervised outlier detection: Regarded as applications of semi-supervised learning
- If some labeled normal objects are available
- Use the labeled examples and the proximate unlabeled objects to train a model for normal objects
- Those not fitting the model of normal objects are detected as outliers
- If only some labeled outliers are available, a small number of labeled outliers many not cover the possible outliers well
- To improve the quality of outlier detection, one can get help from models for normal objects learned from unsupervised methods
Mining Contextual Outliers:
Transform into Conventional Outlier Detection
If the contexts can be clearly identified, transform it to conventional outlier detection
- Identify the context of the object using the contextual attributes
- Calculate the outlier score for the object in the context using a conventional outlier detection method
- Ex. Detect outlier customers in the context of customer groups
- Contextual attributes: age group, postal code
- Behavioral attributes: # of trans/yr, annual total trans. amount
Steps:
(1) locate c’s context,
(2) compare c with the other customers in the same group, and
(3) use a conventional outlier detection method
Challenges of Outlier Detection
- Modeling normal objects and outliers properly
- Hard to enumerate all possible normal behaviors in an application
- The border between normal and outlier objects is often a gray area
- Application-specific outlier detection
- Choice of distance measure among objects and the model of relationship among objects are often application-dependent
- E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations
- Handling noise in outlier detection
- Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection
- Understandability
- Understand why these are outliers: Justification of the detection
- Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism