Table of Contents

Graph Mining

Graph Mining is a

Methods for Mining Frequent Subgraphs
Mining Variant and Constrained Substructure Patterns
Applications of Graph Mining are :
- Graph Indexing
- Similarity Search
- Classification and Clustering

Why Graph Mining ?

Graphs are ubiquitous
- Chemical compounds (Cheminformatics).
- Protein structures, biological pathways/networks (Bioinformactics).
- Program control flow, traffic flow, and workflow analysis.
- XML databases, Web, and social network analysis.
Graph is a general model
- Trees, lattices, sequences, and items are degenerated graphs.
Diversity of graphs
- Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D).
Complexity of algorithms: many problems are of high complexity.

Graph Pattern Mining

Frequent subgraphs
A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold
Applications of graph pattern mining

Mining biochemical structures
Program control flow analysis
Mining XML structures or Web communities

Graph Pattern Explosion Problem

If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property
An n-edge frequent graph may have 2n subgraphs
Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%

Closed Frequent Graphs

Motivation: Handling graph pattern explosion problem
Closed frequent graph
- A frequent graph G is closed if there exists no supergraph of G that carries the same support as G
If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (i.e. non-closed graphs)

Graph Clustering

Graph similarity measure

Feature-based similarity measure
- Each graph is represented as a feature vector
- The similarity is defined by the distance of their corresponding vectors
- Frequent subgraphs can be used as features
Structure-based similarity measure
- Maximal common subgraph
- Graph edit distance: insertion, deletion, and relabel
- Graph alignment distance

Graph Classification

Local structure based approach

Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length

Graph pattern-based approach

Subgraph patterns from domain knowledge
Subgraph patterns from data mining

Graph Pattern-Based Classification

Subgraph patterns from domain knowledge
- Molecular descriptors
Subgraph patterns from data mining
General idea
- Each graph is represented as a feature vector x = {x1, x2, …, xn}, where xi is the frequency of the i-th pattern in that graph
- Each vector is associated with a class label
- Classify these vectors in a vector space

Graph Mining Challenges

High computational cost
- Need to tradeoff between efficiency/scalability and quality
Sophisticated graphs
- May involve weights and/or cycles.
High dimensionality
- A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph
Sparsity
- A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices
- A similarity matrix from a large sparse graph can also be sparse

What is Graph Mining ? Graph Mining Challenges

Graph Mining

Why Graph Mining ?