NBCTK: Naive Bayes Clustering Toolkit
NBCTK (Naive Bayes Clustering Toolkit) is a C implementation of several
probabilistic inference algorithms related to naive Bayes clustering —
probabilistic clustering based on a naive Bayes model.
News
Version 0.7 was released on May 2, 2011 new!
Features
NBCTK receives a dataset in tabular format with discrete (nominal) and
continuous (numeric) values, and performs the following tasks:
- Clustering via EM learning
- Avoiding undesirable local maxima by:
- Random restarts
- Deterministic annealing EM (DAEM) algorithm
- Split-merge EM (SMEM) algorithm
- Determining the number of clusters by a Bayesian score
- Others:
- Relevance analysis between the latent classes and objects (or attribute values)
- Evaluation of clusters (using purity, Rand index, etc.)
We can switch the underlying learning framework to ML (maximum likelihood), MAP
(maximum a posteriori) or VB (variational Bayes), according to the data of interest.
NBCTK also has additional features:
- Missing values allowed
- Flexible configurations allowed
- Auxiliary tool for post-processing
- Rich reporting
- OpenMP support (not so optimized)
Download
Old versions are available from here.
Installation
Please follow the instructions described in the manual.
License
NBCTK is distributed under the
modified BSD license.
Update history
- May 2, 2011: Released version 0.7
- changed the tool name (NBCT -> NBCTK)
- allowed users to specify attribute names
- introduced a mechanism for automatic expansion of the matrix/vector pools
- abolished the functionality for computing dissimilarities between cases
- April 22, 2011: Released version 0.6
- renewed the way of specifying Dirichlet hyperparameters (
vnbc
)
- renewed the usage of the
--smooth-var
option (nbc
, vnbc
)
- added several options:
--version
, --predict
(-P
), --regular-stat
, --extreme
(nbc
, vnbc
)
- made
-U
no longer being a short option of --uncompress
(nbc
, vnbc
)
- added the
--hide-file
(-F
) option (nbcsep
)
- made cumulative bug fixes and a lot of code refinements
- June 15, 2010: Released version 0.5.1
- fixed a problem in compilation with GCC 4.4
- Aug. 12, 2009: Updated the to-do list
- Apr. 18, 2009: Released version 0.5
- introduced routines for cluster evaluation
- renamed the option
--log-valued
to --log-scale
(though its short option remained unchanged as -J
)
- made cumulative bug fixes and a lot of code refinements
- Mar. 9, 2009: Released version 0.4.3
- fixed a problem that the source code could not be compiled on Mac OS X
due to a name confliction for getopt()
- Mar. 1, 2009: Released version 0.4.2
- corrected a wrong computation in predicting the clusters when using
univariate Gaussian distributions in variational Bayes (the variance
in student's t-distribution was estimated as half of the correct one)
- Feb. 26, 2009: Released version 0.4.1
- corrected a wrong computation in computing the Cheeseman-Stutz score
when using univariate Gaussian distributions
- Jan. 24 and Feb. 25, 2009: Updated the to-do list
- Dec. 17, 2008: Released version 0.4
- extended to handle continuous values
- introduced a mechanism for log-valued probability computations for avoiding underflow
- almost entirely reorganized the optional flags (e.g. introduced long options)
- made a lot of minor improvements
- Dec. 17, 2008: Renewed this site and updated the to-do list
- Nov. 7, 2007: Updated the to-do list
- Sept. 4, 2007: Released version 0.3
- introduced the deterministic annealing EM algorithm
- allowed missing values in the data file
- corrected wrong computations in the partial EM step of VB-SMEM
- Aug. 20, 2007: Released version 0.2
- extended
nbcsep
so that it can handle multiple files
- fixed several bugs in printer routines
- fixed several bugs with OpenMP
- Aug. 18, 2007: Released version 0.1
- Aug. 14, 2007: Released version 0.1 beta 6
- Aug. 13, 2007: Released version 0.1 beta 5
- Aug. 11, 2007: Released version 0.1 beta 4
- Aug. 10, 2007: Released version 0.1 beta 3
- Aug. 9, 2007: Released version 0.1 beta 2
- June 12, 2007: Released version 0.1 beta 1 (internal only)
- June 7, 2007: Created this site
To-do
- Outlier detection
- Introduction of a configuration file
- SMEM algorithm which tries to simultaneously solve the problems of avoiding undesirable local maxima and finding the optimal number of clusters
- Integration with some specialized linear algebra programs
Related software
- AutoClass —
NBCTK was much influenced by AutoClass, a well-known probabilistic clustering tool.
On the other hand, NBCTK has a couple of features not found in AutoClass
(e.g. equipped with VB learning, DAEM and SMEM)
- PRISM —
Most of the statistical techniques implemented in NBCTK are shared with PRISM,
a probabilistic logic programming system.
Contact information
This software is developed by Yoshitaka Kameya.
NBCTK is still under development, so any feedbacks are highly welcome.
Please feel free to send e-mails to ykameya[at]meijo-u.ac.jp
(please replace [at]
with @
).
Last update: Apr. 5, 2015