BayesDB is a probabilistic programming platform that provides built-in non-parametric Bayesian model discovery. BayesDB makes it easy for users without statistics training to search, clean, and model multivariate databases using an SQL-like language.

BayesDB is based on probabilistic programming, an emerging field based on the insight that probabilistic models and inference algorithms are a new kind of software, and therefore amenable to radical improvements in accessibility, productivity, and scale. Unfortunately, most probabilistic programming systems require users to write probabilistic programs by hand. Instead, BayesDB provides a built-in probabilistic program synthesis system that builds generative models for multivariate databases via inference over programs given a non-parametric Bayesian prior. BayesDB also enables statisticians to override these programs with custom statistical models when appropriate.

AI-assisted collaborative science with BayesDB

We are currently engaged in applied research in AI-assisted collaborative science using BayesDB, supported in part by DARPA’s Synergistic Discovery and Design (SD2) program. Under this program, BayesDB is being used to help groups of scientists search and screen datasets generated by multiple experimental labs, and also to generate recommendations for what new data would be most valuable to gather. These capabilities rest on inference hierarchical generative models, some provided by domain scientists and some built by BayesDB from experimental data.

Example AI-assisted data science capabilities from existing applications of BayesDB:

  1. Probabilistic search: query by example (“find me the 10 colleges most like MIT and Harvard with regards to graduates’ median income, but admissions rates over 10%”), and query by probability (“find me colleges with the most unexpectedly high mean income for graduates”).

  2. AI assessment of data quality: “find me radiation measurements from the project that are most likely to be errors and/or legitimate anomalies.”

  3. Virtual experiments: “generate some EEG measurements we might expect for a child in Bangladesh at age 3 years of age, given all the other EEG measurements we have observed so far.”

  4. AI-assisted inferential statistics: “what genetic markers, if any, predict increased risk of suicide given a PTSD diagnosis? and how confident can we be in the amount of increase, given uncertainty due to statistical sampling and the large number of possible alternative explanations?”


The source code for bayeslite, a python/sqlite3 implementation of BayesDB, is available on Github. We recommend installing BayesDB using the MIT Inference Stack.


Probabilistic search for structured data via probabilistic programming and nonparametric Bayes. Saad, F.; Casarsa, L.; and Mansinghka, V. arXiv preprint, arXiv:1704.01087. 2017. PDF

Detecting dependencies in sparse, multivariate databases using probabilistic programming and non-parametric Bayes. Saad, F.; and Mansinghka, V. In Artificial Intelligence and Statistics (AISTATS). 2017. PDF

A Probabilistic Programming Approach To Probabilistic Data Analysis. Saad, F.; and Mansinghka, V. In Advances in Neural Information Processing Systems (NIPS). 2016. PDF

Probabilistic data analysis with probabilistic programming. Saad, F.; and Mansinghka, V. arXiv preprint, arXiv:1608.05347. 2016. PDF

BayesDB: A probabilistic programming system for querying the probable implications of data. Mansinghka, V.; Tibbetts, R.; Baxter, J.; Shafto, P.; and Eaves, B. arXiv preprint, arXiv:1512.05006. 2015. PDF