A fully Bayesian nonparametric method for analyzing heterogenous, high dimensional data.
CrossCat is a domain-general, Bayesian method for analyzing high-dimensional data tables. CrossCat estimates the full joint distribution over the variables in the table from the data, via approximate inference in a hierarchical, nonparametric Bayesian model, and provides efficient samplers for every conditional distribution. CrossCat combines strengths of nonparametric mixture modeling and Bayesian network structure learning: it can model any joint distribution given enough data by positing latent variables, but also discovers independencies between the observable variables.
A range of exploratory analysis and predictive modeling tasks can be addressed via CrossCat, including detecting predictive relationships between variables, finding multiple overlapping clusterings, imputing missing values, and simultaneously selecting features and classifying rows. Research on CrossCat has shown that it is suitable for analysis of real-world tables of up to 10 million cells, including hospital cost and quality measures, voting records, handwritten digits, and state-level unemployment time series.
Shafto, P., Kemp, C., Mansinghka, V.K., & Tenenbaum, J.B. (2011). A probabilistic model of cross-categorization. Cognition, 120, 1-25. [PDF]
CrossCat: a fully Bayesian, nonparametric method for analyzing heterogeneous, high-dimensional data. Mansinghka, V., Shafto, P., Jonas, E., Petschulat, C., Gasner, M., and Tenenbaum, J. [Forthcoming, Journal of Machine Learning Research. Preprint available on arXiv, number 1512.01272], 2015. [PDF]
Li, D. & Shafto, P. (2011). Bayesian hierarchical cross-clustering. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR W&CP, 15, 443-451.
CrossCat and its sister project, BayesDB, are being developed by the ProbComp Team at Massachusetts Institute of Technology and by Pat Shafto and Baxter Eaves at University of Louisville.
If you have any comments or questions, please feel free to email us at crosscat [AT] mit.edu.
This research and development is supported in part by the DARPA XDATA program.