Overview
ClustBoost is a an Open Source
Template Library useful for text
clustering. Nowadays, it implements two clustering algorithms KMeans
and Shingles,
but i can easily extended to include other algorithms.
Features
 Implements many clustering algorithms
 Open Sources, under the GPL
 C++ code with templates
and STL
 Several support classes for text reading useful for many
clustering algorithms (Reader, Dictionary, etc...)
KMeans Features
 Classical KMeans
algorithm optimized for text clustering over a stream of data
 Vector
Space model is implemented using sparse vectors (sorted list
of pairs)
Shingles Features
 Broder's
Shingles algorithm optimized for text clustering over a stream of
data
 The lenght of a shingle is unbounded (from a single word,
to the whole document)
 Linear
MinWise Permutation with 64 bit fingerprints, and blockoriented
computation
 Clusters are mantained tru a unionfind data structure
Download
You can download ClustBoost
