|
Abstract:
|
Analysis of large collections of data has become inescapable in many areas of scientific and commercial endeavor . As the size and dimensionality of these collections exceed the pattern recognition capability of the human mind computational analysis tools become a necessity for interpretation . Clustering
algorithms , which aim to find interesting groupings within collections of data , are one such tool . Each algorithm incorporates into its design an inherent definition of “interesting” intended to capture nonrandom data groupings likely to have some interpretation to human users . Most existing algorithms include
as part of their definition of “interesting” an assumption that each data point can belong at most to one grouping . While this assumption allows for algorithmic convenience and ease of analysis , it is often an artificial imposition on true underlying data structure . The idea of allowing points to belong to multiple groupings - known as “overlapping” or “multiple membership” clustering - has emerged in several domains in ad hoc solutions lacking conceptual unity
in approach , interpretation , and analysis . This dissertation proposes general , domain -independent elucidations and practical techniques which address each of these .
We begin by positing overlapping clustering’s role specifically , and clustering’s role in general , as assistive technologic tools allowing human minds to represent and interpret structures in data beyond the capability of our innate senses . With this guiding purpose clarified , we provide a catalog of existing techniques . We then address the issue of objectively comparing the results
of different algorithms , specifically examining the previously defined Omega
index , as well as multiple membership generalizations of normalized mutual information . Following that comparison , we propose a novel approach to com -
paring clusterings called cluster alignment . By combining a sorting algorithm with a greedy matching algorithm , we produce comparably organized membership matrices and a means for both numerically and visually comparing multiple -membership assignments . With overlapping clustering’s purpose defined , and the means to analyze results , we move on to presenting algorithms for efficiently discovering overlapping clusters in data . First , we present a generalization of one of the common themes in the ad hoc approaches : additive
clustering . Starting with a previously developed structural model of additive clustering , we generalize it to be applicable to any regular exponential
family distribution thereby extending its utility into several domains , notably high -dimensional sparse domains including text and recommender systems . Finally , we address overlapping clustering by examining the properties of data
in similarity spaces . We develop a probabilistic generative model of overlapping data in similarity spaces , and then develop two conceptual approaches to
discovering overlapping clustering in similarity spaces . The first of these is the conceptual multiple -membership generalization of hierarchical agglomerative
clustering , and the second is an iterative density hill -climbing algorithm . |