Investigation Of Techniques To Increase The Scalability Of Graph-based Data Mining Algorithms.

Date

2007-08-23T01:56:09Z

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Computer Science & Engineering

Abstract

Frequent subgraph pattern recognition and graph-based relational learning have been an emerging area of data mining research with scientific and commercial applications. At the kernel of these algorithms are the computationally-expensive graph and subgraph isomorphism tests.
The graph isomorphism problem consists in deciding whether two graphs are isomorphic i.e., whether there is a one-one mapping between the vertices of the two graphs that respects the edge connections. Many graphs will be depicted quite differently but in actuality have the same inherent structure. This leads to the isomorphism problem. The graph isomorphism problem belongs to the class of NP problems and has been conjectured intractable though probably not NP-complete. We hypothesize that approximation algorithms can be developed for the graph and subgraph isomorphism problems, and that these algorithms can improve the runtime of data mining systems that rely on these capabilities. We analyze the validity of our hypothesis by implementing and testing three approaches to the problem: a genetic algorithm for subgraph isomorphism detection, canonical labeling of graphs for graph isomorphism testing and a technique that reduces the need for isomorphism tests in SUBDUE. Canonical labeling is a technique that assigns a unique code to a graph that is invariant on the order of the vertices and the edges in the graph. As a result two graphs will have same canonical labels if they are isomorphic and vice versa. In cases where many isomorphism checks are required between same set of graphs, a better way of performing this task is to assign each graph a canonical label. Our research has considered canonical labeling technique in SUBDUE to reduce the number of calls to the graphMatch routine and also as an alternative to the graphMatch routine. The subgraph isomorphism problem consists in deciding whether a graph is isomorphic to a subgraph of another graph. Subgraph isomorphism belongs to the class of NP-complete problems. Genetic algorithms represent an approximation technique that runs for a certain number of generations, retaining the best chromosomes from the current generation to the next generation and producing new chromosomes using the genetic operators of selection, crossover and mutation. This approach is inspired by the natural process of evolution. In our research, a genetic algorithm approach has been considered for subgraph isomorphism detection in SUBDUE to find the instances of the predefined substructures in the input graphs. Since it is an approximation technique, it is not guaranteed to find all the instances of the subgraph in the main graph. Finally an approach taken by Potts is analyzed that reduces the number of calls to graphMatch by changing the order in which the instances are extended in SUBDUE.

Description

Keywords

Citation