Table of Contents
Graph Mining
Graph Mining is a
- Methods for Mining Frequent Subgraphs
- Mining Variant and Constrained Substructure Patterns
- Applications of Graph Mining are :
- Graph Indexing
- Similarity Search
- Classification and Clustering
Why Graph Mining ?
- Graphs are ubiquitous
- Chemical compounds (Cheminformatics).
- Protein structures, biological pathways/networks (Bioinformactics).
- Program control flow, traffic flow, and workflow analysis.
- XML databases, Web, and social network analysis.
- Graph is a general model
- Trees, lattices, sequences, and items are degenerated graphs.
- Diversity of graphs
- Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D).
- Complexity of algorithms: many problems are of high complexity.
Graph Pattern Mining
Frequent subgraphs
A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold
Applications of graph pattern mining
- Mining biochemical structures
- Program control flow analysis
- Mining XML structures or Web communities
Graph Pattern Explosion Problem
- If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property
- An n-edge frequent graph may have 2n subgraphs
- Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%
Closed Frequent Graphs
- Motivation: Handling graph pattern explosion problem
- Closed frequent graph
- A frequent graph G is closed if there exists no supergraph of G that carries the same support as G
- If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (i.e. non-closed graphs)
Graph Clustering
Graph similarity measure
- Feature-based similarity measure
- Each graph is represented as a feature vector
- The similarity is defined by the distance of their corresponding vectors
- Frequent subgraphs can be used as features
- Structure-based similarity measure
- Maximal common subgraph
- Graph edit distance: insertion, deletion, and relabel
- Graph alignment distance
Graph Classification
Local structure based approach
- Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length
Graph pattern-based approach
- Subgraph patterns from domain knowledge
- Subgraph patterns from data mining
Graph Pattern-Based Classification
- Subgraph patterns from domain knowledge
- Molecular descriptors
- Subgraph patterns from data mining
- General idea
- Each graph is represented as a feature vector x = {x1, x2, …, xn}, where xi is the frequency of the i-th pattern in that graph
- Each vector is associated with a class label
- Classify these vectors in a vector space
Graph Mining Challenges
- High computational cost
- Need to tradeoff between efficiency/scalability and quality
- Sophisticated graphs
- May involve weights and/or cycles.
- High dimensionality
- A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph
- Sparsity
- A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices
- A similarity matrix from a large sparse graph can also be sparse