BayesDB is a platform for AI-assisted data science that enables domain experts to answer questions in seconds or minutes that otherwise require hours or days of work by someone with good statistical judgment.

Why is this important? Stakeholders in business, humanitarian work, science, and government are increasingly recognizing the importance of making statistical inferences from their data. Existing approaches to this problem require experts in statistical modeling or data scientists proficient in applied machine learning. These skills are projected to be in short supply as the importance of statistical inference is increasingly recognized across a variety of fields. Also, developers new to machine learning may be stymied by the maze that is the current machine learning toolkit. This toolkit can come up short in settings that don’t match canonical machine learning problems.

How does BayesDB solve these problems? First, BayesDB provides a simple, SQL-like query language for asking data science queries. This language can be used for data search, inferential statistics, and predictive modeling applications. Second, BayesDB provides AI assistance for exploratory data analysis and baseline statistical modeling via CrossCat, a probabilistic method that emulates many of the judgment calls ordinarily made by a human data analyst. This AI assistance enables domain experts who lack training in statistics to draw rigorous inferences from messy real-world databases. Third, BayesDB provides an advanced “meta-modeling” language for customizing the AI’s modeling assumptions to incorporate both quantitative and qualitative domain knowledge. This enables expert statisticians to improve inference quality in risk-sensitive applications.

Example AI-assisted data science capabilities from existing applications of BayesDB:

  1. Probabilistic search: query by example (“find me the 10 colleges most like MIT and Harvard with regards to graduates’ median income, but admissions rates over 10%”), and query by probability (“find me colleges with the most unexpectedly high mean income for graduates”).

  2. AI assessment of data quality: “find me radiation measurements from the project that are most likely to be errors and/or legitimate anomalies.”

  3. Virtual experiments: “generate some EEG measurements we might expect for a child in Bangladesh at age 3 years of age, given all the other EEG measurements we have observed so far.”

  4. AI-assisted inferential statistics: “what genetic markers, if any, predict increased risk of suicide given a PTSD diagnosis? and how confident can we be in the amount of increase, given uncertainty due to statistical sampling and the large number of possible alternative explanations?”


The source code for bayeslite, a python/sqlite3 implementation of BayesDB, is available on Github. We recommend installing BayesDB using the Open Probabilistic Programming Stack.


Probabilistic search for structured data via probabilistic programming and nonparametric Bayes. Saad, F.; Casarsa, L.; and Mansinghka, V. arXiv preprint, arXiv:1704.01087. 2017. PDF

Detecting dependencies in sparse, multivariate databases using probabilistic programming and non-parametric Bayes. Saad, F.; and Mansinghka, V. In Artificial Intelligence and Statistics (AISTATS). 2017. PDF

A Probabilistic Programming Approach To Probabilistic Data Analysis. Saad, F.; and Mansinghka, V. In Advances in Neural Information Processing Systems (NIPS). 2016. PDF

Probabilistic data analysis with probabilistic programming. Saad, F.; and Mansinghka, V. arXiv preprint, arXiv:1608.05347. 2016. PDF

BayesDB: A probabilistic programming system for querying the probable implications of data. Mansinghka, V.; Tibbetts, R.; Baxter, J.; Shafto, P.; and Eaves, B. arXiv preprint, arXiv:1512.05006. 2015. PDF