Overview
ClustBoost is a an Open Source
Template Library useful for text
clustering. Nowadays, it implements two clustering algorithms K-Means
and Shingles,
but i can easily extended to include other algorithms.
Features
- Implements many clustering algorithms
- Open Sources, under the GPL
- C++ code with templates
and STL
- Several support classes for text reading useful for many
clustering algorithms (Reader, Dictionary, etc...)
K-Means Features
- Classical K-Means
algorithm optimized for text clustering over a stream of data
- Vector
Space model is implemented using sparse vectors (sorted list
of pairs)
Shingles Features
- Broder's
Shingles algorithm optimized for text clustering over a stream of
data
- The lenght of a shingle is unbounded (from a single word,
to the whole document)
- Linear
Min-Wise Permutation with 64 bit fingerprints, and block-oriented
computation
- Clusters are mantained tru a union-find data structure
Download
You can download ClustBoost
|