What is Cluster Analysis ? Type of data in clustering analysis

Cluster Analysis : Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

Cluster Analysis
Cluster Analysis



Examples of Clustering Applications

  • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
  • Land use: Identification of areas of similar land use in an earth observation database
  • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
  • City-planning: Identifying groups of houses according to their house type, value, and geographical location
  • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

What is not Cluster Analysis?

  • Supervised classification
    • Have class label information
  • Simple segmentation
    • Dividing students into different registration groups alphabetically, by last name
  • Results of a query
    • Groupings are a result of an external specification

What Is Good Clustering?

  • A good clustering method will produce high quality clusters with
    • high intra-class similarity
    • low inter-class similarity
  • The quality of a clustering result depends on both the similarity measure used by the method and its implementation
  • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Measure the Quality of Clustering

  • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)
  • There is a separate “quality” function that measures the “goodness” of a cluster.
  • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.
  • Weights should be associated with different variables based on applications and data semantics.
  • It is hard to define “similar enough” or “good enough”
    • the answer is typically highly subjective.



Requirements of Clustering in Data Mining

  • Scalability
  • Ability to deal with different types of attributes
  • Discovery of clusters with arbitrary shape
  • Minimal requirements for domain knowledge to determine input parameters
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • High dimensionality
  • Incorporation of user-specified constraints
  • Interpretability and usability



Type of data in clustering analysis

  • Interval-scaled variables
  • Binary variables
  • Nominal, ordinal, and ratio variables
  • Variables of mixed types

Interval-valued variables

  • Standardize data
    • Calculate the mean absolute deviation:
Interval-valued variables
Interval-valued variables
  • where
Interval-valued variables
Interval-valued variables
  • Calculate the standardized measurement (z-score)
z-score
z-score
  • Using mean absolute deviation is more robust than using standard deviation

Similarity and Dissimilarity Between Objects

  • Distances are normally used to measure the similarity or dissimilarity between two data objects
  • Some popular ones include: Minkowski distance:
Object
Object

where  i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

  • If q = 1, d is Manhattan distance
Manhattan distance
Manhattan distance

Binary Variables

  • A contingency table for binary data
contingency table
contingency table
  • Distance measure for symmetric binary variables:
symmetric binary variables
symmetric binary variables
  • Distance measure for asymmetric binary variables:
Distance measure for asymmetric binary variables
Distance measure for asymmetric binary variables
    • Jaccard coefficient (similarity measure for asymmetric binary variables):
Jaccard coefficient
Jaccard coefficient

Nominal Variables

  • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
  • Method 1: Simple matching
  • m: # of matches, p: total # of variables
Nominal Variables
Nominal Variables
  • Method 2: use a large number of binary variables
  • creating a new binary variable for each of the M nominal states

Ordinal Variables

  • An ordinal variable can be discrete or continuous
  • Order is important, e.g., rank
  • Can be treated like interval-scaled
  • replace xif by their rank
  • map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
Ordinal Variables
Ordinal Variables
  • compute the dissimilarity using methods for interval-scaled variables

Notion of a Cluster can be Ambiguous

Notion of a Cluster can be Ambiguous
Notion of a Cluster can be Ambiguous



Types of Clusterings

  • A clustering is a set of clusters
  • An important distinction among types of clusterings : hierarchical and partitional sets of clusters
  • Partitional Clustering
    • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
  • Hierarchical clustering
    • A set of nested clusters organized as a hierarchical tree

Partitional Clustering

Partitional Clustering
Partitional Clustering

Hierarchical Clustering

Hierarchical Clustering
Hierarchical Clustering

Other Distinctions Between Sets of Clusters

  • Exclusive versus non-exclusive
    • In non-exclusive clusterings, points may belong to multiple clusters.
    • Can represent multiple classes or ‘border’ points
  • Fuzzy versus non-fuzzy
    • In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1
    • Weights must sum to 1
    • Probabilistic clustering has similar characteristics
  • Partial versus complete
    • In some cases, we only want to cluster some of the data
  • Heterogeneous versus homogeneous
    • Cluster of widely different sizes, shapes, and densities

Types of Clusters

Clusters can be of many types:

  • Well-separated clusters
  • Center-based clusters
  • Contiguous clusters
  • Density-based clusters

Types of Clusters: Well-Separated

A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.
Well-Separated Clusters

Types of Clusters: Center-Based

  • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster
  • The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster
Center-Based
4 Center-Based

Types of Clusters: Contiguity-Based

  • A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
Contiguous Cluster
8 Contiguous Cluster

Types of Clusters: Density-Based

  • A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.
  • Used when the clusters are irregular or intertwined, and when noise and outliers are present.
6 density-based clusters
6 density-based clusters

Types of Clusters: Conceptual Clusters

Shared Property or Conceptual Clusters

  • Finds clusters that share some common property or represent a particular concept.
2 Overlapping Circle
2 Overlapping Circle

Types of Clusters: Objective Function

Clusters Defined by an Objective Function

  • Finds clusters that minimize or maximize an objective function.
  • Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness’ of each potential set of clusters by using the given objective function. (NP Hard)
  • Can have global or local objectives.
    • Hierarchical clustering algorithms typically have local objectives
    • Partitional algorithms typically have global objectives
  • A variation of the global objective function approach is to fit the data to a parameterized model.
    • Parameters for the model are determined from the data.
    • Mixture models assume that the data is a ‘mixture’ of a number of statistical distributions.
  • Map the clustering problem to a different domain and solve a related problem in that domain
    • Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points
    • Clustering is equivalent to breaking the graph into connected components, one for each cluster.
    • Want to minimize the edge weight between clusters and maximize the edge weight within clusters

Important Characteristics of the Input Data

  • Type of proximity or density measure
    • This is a derived measure, but central to clustering
  • Attribute type
    • Dictates type of similarity
  • Type of Data
    • Dictates type of similarity
    • Other characteristics, e.g., autocorrelation
  • Dimensionality
  • Noise and Outliers
  • Type of Distribution