What is Graph Mining ? Graph Mining Challenges

Graph Mining

Graph Mining is a

  • Methods for Mining Frequent Subgraphs
  • Mining Variant and Constrained Substructure Patterns
  • Applications of Graph Mining are :
    • Graph Indexing
    • Similarity Search
    • Classification and Clustering



Why Graph Mining ?

  • Graphs are ubiquitous
    • Chemical compounds (Cheminformatics).
    • Protein structures, biological pathways/networks (Bioinformactics).
    • Program control flow, traffic flow, and workflow analysis.
    • XML databases, Web, and social network analysis.
  • Graph is a general model
    • Trees, lattices, sequences, and items are degenerated graphs.
  • Diversity of graphs
    • Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D).
  • Complexity of algorithms: many problems are of high complexity.

 


Graph Pattern Mining

Frequent subgraphs
A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold
Applications of graph pattern mining

  • Mining biochemical structures
  • Program control flow analysis
  • Mining XML structures or Web communities

Graph Pattern Explosion Problem

  • If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property
  • An n-edge frequent graph may have 2n subgraphs
  • Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%

Closed Frequent Graphs

  • Motivation: Handling graph pattern explosion problem
  • Closed frequent graph
    • A frequent graph G is closed if there exists no supergraph of G that carries the same support as G
  • If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (i.e. non-closed graphs)

Graph Clustering

Graph similarity measure

  • Feature-based similarity measure
    • Each graph is represented as a feature vector
    • The similarity is defined by the distance of their corresponding vectors
    • Frequent subgraphs can be used as features
  • Structure-based similarity measure
    • Maximal common subgraph
    • Graph edit distance: insertion, deletion, and relabel
    • Graph alignment distance

Graph Classification

Local structure based approach

  • Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length

Graph pattern-based approach

  • Subgraph patterns from domain knowledge
  • Subgraph patterns from data mining

Graph Pattern-Based Classification

  • Subgraph patterns from domain knowledge
    • Molecular descriptors
  • Subgraph patterns from data mining
  • General idea
    • Each graph is represented as a feature vector x = {x1, x2, …, xn}, where xi is the frequency of the i-th pattern in that graph
    • Each vector is associated with a class label
    • Classify these vectors in a vector space



Graph Mining Challenges

  • High computational cost
    • Need to tradeoff between efficiency/scalability and quality
  • Sophisticated graphs
    • May involve weights and/or cycles.
  • High dimensionality
    • A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph
  • Sparsity
    • A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices
    • A similarity matrix from a large sparse graph can also be sparse