NBCTK: Naive Bayes Clustering Toolkit

NBCTK (Naive Bayes Clustering Toolkit) is a C implementation of several probabilistic inference algorithms related to naive Bayes clustering — probabilistic clustering based on a naive Bayes model.

News

Version 0.7 was released on May 2, 2011 new!

Features

NBCTK receives a dataset in tabular format with discrete (nominal) and continuous (numeric) values, and performs the following tasks:

Clustering via EM learning
- Avoiding undesirable local maxima by:
  - Random restarts
  - Deterministic annealing EM (DAEM) algorithm
  - Split-merge EM (SMEM) algorithm
- Determining the number of clusters by a Bayesian score
Others:
- Relevance analysis between the latent classes and objects (or attribute values)
- Evaluation of clusters (using purity, Rand index, etc.)

We can switch the underlying learning framework to ML (maximum likelihood), MAP (maximum a posteriori) or VB (variational Bayes), according to the data of interest. NBCTK also has additional features:

Missing values allowed
Flexible configurations allowed
Auxiliary tool for post-processing
Rich reporting
OpenMP support (not so optimized)

Download

Old versions are available from here.

Installation

Please follow the instructions described in the manual.

License

NBCTK is distributed under the modified BSD license.

Update history

May 2, 2011: Released version 0.7
- changed the tool name (NBCT -> NBCTK)
- allowed users to specify attribute names
- introduced a mechanism for automatic expansion of the matrix/vector pools
- abolished the functionality for computing dissimilarities between cases
April 22, 2011: Released version 0.6
- renewed the way of specifying Dirichlet hyperparameters (vnbc)
- renewed the usage of the --smooth-var option (nbc, vnbc)
- added several options: --version, --predict (-P), --regular-stat, --extreme (nbc, vnbc)
- made -U no longer being a short option of --uncompress (nbc, vnbc)
- added the --hide-file (-F) option (nbcsep)
- made cumulative bug fixes and a lot of code refinements
June 15, 2010: Released version 0.5.1
- fixed a problem in compilation with GCC 4.4
Aug. 12, 2009: Updated the to-do list
Apr. 18, 2009: Released version 0.5
- introduced routines for cluster evaluation
- renamed the option --log-valued to --log-scale (though its short option remained unchanged as -J)
- made cumulative bug fixes and a lot of code refinements
Mar. 9, 2009: Released version 0.4.3
- fixed a problem that the source code could not be compiled on Mac OS X due to a name confliction for getopt()
Mar. 1, 2009: Released version 0.4.2
- corrected a wrong computation in predicting the clusters when using univariate Gaussian distributions in variational Bayes (the variance in student's t-distribution was estimated as half of the correct one)
Feb. 26, 2009: Released version 0.4.1
- corrected a wrong computation in computing the Cheeseman-Stutz score when using univariate Gaussian distributions
Jan. 24 and Feb. 25, 2009: Updated the to-do list
Dec. 17, 2008: Released version 0.4
- extended to handle continuous values
- introduced a mechanism for log-valued probability computations for avoiding underflow
- almost entirely reorganized the optional flags (e.g. introduced long options)
- made a lot of minor improvements
Dec. 17, 2008: Renewed this site and updated the to-do list
Nov. 7, 2007: Updated the to-do list
Sept. 4, 2007: Released version 0.3
- introduced the deterministic annealing EM algorithm
- allowed missing values in the data file
- corrected wrong computations in the partial EM step of VB-SMEM
Aug. 20, 2007: Released version 0.2
- extended nbcsep so that it can handle multiple files
- fixed several bugs in printer routines
- fixed several bugs with OpenMP
Aug. 18, 2007: Released version 0.1
Aug. 14, 2007: Released version 0.1 beta 6
Aug. 13, 2007: Released version 0.1 beta 5
Aug. 11, 2007: Released version 0.1 beta 4
Aug. 10, 2007: Released version 0.1 beta 3
Aug. 9, 2007: Released version 0.1 beta 2
June 12, 2007: Released version 0.1 beta 1 (internal only)
June 7, 2007: Created this site

To-do

Outlier detection
Introduction of a configuration file
SMEM algorithm which tries to simultaneously solve the problems of avoiding undesirable local maxima and finding the optimal number of clusters
Integration with some specialized linear algebra programs

Related software

AutoClass — NBCTK was much influenced by AutoClass, a well-known probabilistic clustering tool. On the other hand, NBCTK has a couple of features not found in AutoClass (e.g. equipped with VB learning, DAEM and SMEM)
PRISM — Most of the statistical techniques implemented in NBCTK are shared with PRISM, a probabilistic logic programming system.

Contact information

This software is developed by Yoshitaka Kameya. NBCTK is still under development, so any feedbacks are highly welcome. Please feel free to send e-mails to ykameya[at]meijo-u.ac.jp (please replace [at] with @).

Last update: Apr. 5, 2015