ClustBoost: a C++ template Library for clustering



Overview

ClustBoost is a an Open Source Template Library useful for text clustering. Nowadays, it implements two clustering algorithms K-Means and Shingles, but i can easily extended to include other algorithms.

Features

  • Implements many clustering algorithms
  • Open Sources, under the GPL
  • C++ code with templates  and STL
  • Several support classes for text reading useful for many clustering algorithms (Reader, Dictionary, etc...)

K-Means Features

  • Classical K-Means algorithm optimized for text clustering over a stream of data
  • Vector Space model is implemented using sparse vectors  (sorted list of pairs)

Shingles Features

  • Broder's Shingles algorithm optimized for text clustering over a stream of data
  • The lenght of a shingle is unbounded (from a single word, to the whole document)
  • Linear Min-Wise Permutation with 64 bit fingerprints, and block-oriented computation
  • Clusters are mantained tru a union-find data structure

Download

You can download ClustBoost