BayesDB

Query the probable implications of your tabular data as easily as an SQL database lets you query the data itself.

Download: Developer Alpha v0.2.0-pre [Docker][Github][Documentation]

**WARNING:** BayesDB is rapidly-evolving, **alpha-quality** research software. Some key ideas behind its design and implementation have yet to be published, documentation is currently sparse, our release engineering is rough, and there may well be front-end and back-end bugs.

BayesDB, a Bayesian database table, lets users query the probable implications of their tabular data as easily as an SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.

BayesDB is suitable for analyzing complex, heterogeneous data tables with up to tens of thousands of rows and hundreds of variables. No preprocessing or parameter adjustment is required, though experts can override BayesDB's default assumptions when appropriate.

BayesDB assumes that each row of your table is a sample from some fixed population or generative process, and estimates the joint distribution over the columns. BQL then allows you to draw Bayesian inferences about individuals and about the overall population or process. The estimates are currently provided by CrossCat, a new nonparametric Bayesian method for analyzing high-dimensional data tables.

BayesDB is alpha-quality research software, made available to facilitate testing and collaboration, not yet ready for production use. We are actively looking for feedback and contributions. Please sign up here.

`INFER salary FROM mytable WHERE age > 30 WITH CONFIDENCE 0.95;`

Fill in missing data with the `INFER`

command. Unlike a traditional regression model, where you need to separately train a supervised model for each column you're interested in predicting, `INFER`

statements are flexible and work with any set of columns to predict.

`SIMULATE salary FROM mytable WHERE age > 30;`

Easily simulate new probable observations based on CrossCat's estimate of the joint density of the data.

`ESTIMATE PAIRWISE DEPENDENCE PROBABILITIES FROM mytable;`

With just one command, estimate any pairwise function of columns, including the probability that the two columns are statistically dependent, the mutual information between columns, and their correlation.

BayesDB is aimed at:

Users without statistics training but who have tabular data and questions they'd like to answer.

Users with statistics expertise who either have a problem that doesn't fit the standard supervised templates (e.g. the data is too sparse) or who want to analyze a tabular dataset but don't have the time or inclination for custom modeling.

The main goal is to broaden access to analytics and increase productivity. We also hope it will increase the reliability of the empirical reasoning performed by non-experts, who currently lack the ability to build good models or critique the simplistic models that they do have access to.

Of course, when decisions can have costly consequences, there is no substitute for a critical assessment of the underlying rationale. In these situations, please consult a good statistician (be they "Bayesian" or "frequentist"), and if possible invest the time to critically assess her reasoning and methodology. Although BayesDB relies on unusually flexible assumptions, reports its uncertainty and is designed to err in the direction of underconfidence, all assumptions can be misleading in some circumstances.

An 'expert' mode (enabling e.g. custom models via call-outs to, say, custom Python statsmodels-based procs, or even probabilistic programs) would be interesting, but is not our focus right now.

Two ideas:

The BQL language. BQL is designed to encapsulate exploratory analysis and predictive modeling workflows in a way that is (essentially) independent of the underlying probabilistic modeling and inference approach. The technical idea is to turn analytic queries into approximate Bayesian inferences about joint distribution induced by the underlying data generating process, and decouple that from the problem of building a model of that process. We see this as an inferential analogue of the physical data independence afforded by traditional DBMSs; our upcoming paper for more discussion.

CrossCat. CrossCat can be thought of as a flexible, generic meta-model for probabilistic models of data tables that relaxes some of the unrealistic assumptions of typical Bayesian learning techniques. Approximate Bayesian inference in CrossCat tackles some of the model selection and parameter estimation issues that experts normally address by custom exploration, feature selection, model building and validation. Also, it produces estimates of the joint distribution of the data that is easy to conditionally query (unlike a Bayesian network) and that also has some useful latent variables. This makes several BQL commands natural to implement, and supports a "general purpose" BayesDB implementation.

Just as traditional SQL-based DBMSs are specialized for different data shapes and workloads, usually for reasons of performance, we suspect BQL-based DBMSs could be, also, for reasons of performance and predictive accuracy.

**WARNING:** BayesDB is rapidly-evolving, **alpha-quality** research software. Some key ideas behind its design and implementation have yet to be published, documentation is currently sparse, our release engineering is rough, and there may well be front-end and back-end bugs.

We are humbled and excited by the interest in the project, and are working hard to improve BayesDB's maturity

Please send us your application ideas, bug reports, and feature requests. Also, if you think you might be interested in alpha testing, collaborating, or contributing, please write us! We'd love to hear from you at bayesdb [AT] mit.edu or by filling out the BayesDB Tester Signup Form. .

Versions of BayesDB (and CrossCat) have been tried on datasets including surveys, performance metrics for hospitals, branches of nonprofits, econometric tables, feature vectors extracted from text/images, voting records, and gene expression microarrays.

So far we have mostly focused on exploratory analysis and predictive modeling, but there are other interesting possibilities. In principle, it should also be usable for confirmatory analyses, by framing hypothesis tests as queries over the posterior over full joint distributions. We are interested in collaborating with statisticians to define these, and extend BQL to support them. It also may be possible to support online inferential transactions; if this sounds interesting to you, talk to us.

There are some regimes (e.g. N < 100, D ~ 1000, with sparse observations—or N ~ 1000, D ~ 10-100, with lots of highly nonlinear, stochastic relationships) where the CrossCat backend might also offer an advantage over many custom models.

We are very glad you asked; this is a crucial and subtle question. As much as we like probabilistic modeling, Bayesian inference, Bayesian nonparametrics, careful hyperparameter inference and Markov chain Monte Carlo, we know they are not a panacea.

BayesDB and CrossCat are alpha-quality software; there are likely to be bugs that impact inference quality. We can't accept responsibility for any negative consequences arising from its use or misuse.

Modulo bugs, we feel that BayesDB is reliable enough to trust as a first pass on typical data tables. This is because CrossCat has some unusual properties that we think make it a reasonable general-purpose model:

CrossCat contains various simpler, standard black box models as hypotheses in its hypothesis space—including Dirichlet processs mixtures and a form of naive Bayes. Modulo bugs, this means that with enough inference, CrossCat shouldn't perform essentially worse than these simpler methods, and is likely to perform better.

CrossCat includes many hypotheses that correspond to versions of the claim "I can find no signal in this part of the dataset", and is quite happy to report broad predictive distributions when it is uncertain. Although there might actually be signal that another method could have found, we think this is a good direction to err in. Some versions have been tested with the kinds of small N, high-D, extremely noisy datasets that confuse many standard methods and performed surprisingly well, though we have more empirical work to do there.

There are theoretical results that suggest CrossCat is an asymptotically consistent estimator of the joint distribution. This has actually been proved for Dirichlet process mixture models with various component models, and we do not think the increased flexibility of CrossCat interferes (as N → infinity, though there is a subtlety if D → infinity). This roughly means that for a surprisingly broad class of distributions, if you run inference long enough and have enough data, CrossCat will infer a model that predicts equivalently with the 'true' model.

Ultimately, of course, there is no substitute for closely scrutinizing our modeling assumptions and implementation.

Our models are memory efficient, our inference algorithms scale linearly in runtime, we have yet to do serious performance engineering, there is a lot of unexploited parallelism, and we haven't yet incorporated subsampling or more advanced inference techniques. Thus we do not think there are intrinsic barriers to greater scale, even remaining with the CrossCat backend.

A commercial variant of CrossCat, built by the startups Prior Knowledge and Navia Systems (ultimately acquired by Salesforce), addressed many of these issues. If you asked some of us to bet, we would say that a billion rows and ~200 columns should be possible without algorithmic or statistical breakthroughs, but rather careful engineering.

That said:

We have not yet shown that it really works in practice beyond the ~1M-cell tables we've been working on—or that the inferences are reliable enough at that scale, especially with large numbers of columns. At billions or trillions of cells, a different (and maybe more heterogeneous) inference backend might be more appropriate.

The front-end will need to be completely rebuilt for those scales. At some point hosting in something like Spark would be more appropriate.

Right now, BayesDB uses CrossCat to estimate the joint distribution over the columns, and performs posterior averaging over these estimates to answer all BQL queries. CrossCat is more flexible than naive Bayes: it determines which variables are probably independent from the data, as opposed to assuming they all are independent a priori, and estimates a flexible nonparametric Bayesian density model for each group of dependent variables. More formally, it is a Dirichlet process mixture model (over the columns) of Dirichlet process mixture models over the rows. One way to think of it is a nonparametric Bayesian take on the problem that Bayesian network structure learning tries to solve. If you are curious about the details, please read our papers and/or contact us.

CrossCat gives us reasonable models for a broad class of data tables. In the future, we'll want to incorporate other modeling techniques (perhaps chosen via approximate Bayesian model selection) behind the scenes. Two examples we're especially curious about:

For the standard regression/classification setting—

`INFER`

commands with a single target column, large N, and where all other columns for all rows are fully observed—we may be able to improve accuracy by integrating discriminative modeling techniques.BayesDB backends based on deep learning and/or density forests might be interesting for very large datasets, especially ones with unstructured columns. Contact us if you might be interested in collaborating on this.

Nearly deterministic relationships between variables are easy for CrossCat to detect, but awkwardly modeled: a large number of low-variance clusters are introduced, yielding a kind of adaptively-tuned nearest neighbor. We're not yet sure what this means for moderately stochastic linear relationships.

In principle, CrossCat can estimate any joint distribution given enough data, even in the presence of multiple conflicting statistical signals in different groups of columns. In practice, this might require unrealistic amounts of data and computation. We'd like to better characterize the regimes where CrossCat is getting highly accurate (and properly uncertain) estimates, and regimes where another backend would be more appropriate.

We are interested in exploring heterogeneous backends, model ensembles, and richer formulations of model selection for reducing these limitations. We would also like to identify other failure modes, especially if they could reveal a gap between theory and practice.

Right now we only have built-in types for categorical and generic numerical data, modeled by the Dirichlet-Multinomial and Normal-Inverse-Gamma component models, respectively.

A wide range of datatypes are possible, both by remaining within the exponential family—for e.g. count data, ordinal data, positive real data, and more—and by going beyond it, at the cost of slower inference. Unstructured data fields, such as text, and structured compound fields, such as time series in individual cells, will require a more complex approach using more traditional ML. They are on our roadmap but not a high priority.

BayesDB had a nice ring to it—better than CatDB, CrossDB, DBCat, etc. (names based on CrossCat).

The idea behind BQL is that we can formalize many data analysis tasks using Bayesian queries against a posterior distribution over a hypothesis space that itself consists of models of the full joint distribution over the variables in the table. We thought this mathematical notion might let us define a coherent interface to the implications of data that retained the clarity, simplicity and flexibility of SQL. In some sense, it is our analogue of the relational algebra.

Our current BayesDB implementation takes this idea literally. A single probabilistic model, called CrossCat, is estimated from the data, where (ignoring any bugs) each sample produced by `ANALYZE`

is independently drawn from a distribution between the CrossCat prior and the true posterior given the data in the table. The more runtime you give `ANALYZE`

, the closer to the true posterior BayesDB becomes. The more samples you get, the more accurately you can resolve posterior uncertainty.

Some Bayesian priors are about precisely expressing ignorance. This is what CrossCat is doing inside BayesDB: encoding a very broad prior over generative processes that admits possibilities ranging from near determinism to unpredictability.

That said, BQL does allow users to encode some of their prior knowledge, via schema design. Currently this means deciding what the columns and rows in the table are, and what the datatypes are for each column. Going forward, we plan to support additional user-level configurability.

If it survives, BQL will likely evolve towards becoming a general-purpose probabilistic programming language, perhaps even one like Venture or Church. From this perspective, the current CrossCat backend corresponds to a simple kind of Bayesian probabilistic program synthesis, where the class of probabilistic programs it is learning is extremely simple.

CrossCat came out of work by Mansinghka and Shafto in Josh Tenenbaum's computational cognitive science group in a series of academic publications starting in 2006. A fast commercial implementation was built by Navia Systems/Prior Knowledge, ultimately acquired by Salesforce. CrossCat was used to power a prediction API called Veritable (in the spirit of Google Predict, with a nonparametric Bayesian twist) aimed at making it easier for developers to build smarter applications.

BayesDB, the Bayesian Query Language, and our current open-source CrossCat implementation, were funded starting in 2012 by the DARPA XDATA program and Google's "Rethinking AI" project. The audience is analysts with SQL expertise and workflows involving exploratory analysis, confirmatory analysis and predictive modeling. The research question is whether it is possible to do for data science what the relational model did for data management. Accordingly, we expect BQL to evolve in the direction of a general-purpose probabilistic programming language like Venture, with our current CrossCat backend one of many but a reasonable place to start, analogous to a default indexing scheme in a SQL database.

BayesDB and its sister project, CrossCat, are being developed by Jay Baxter, Dan Lovell, and Vikash Mansinghka from the MIT Probabilistic Computing Project and by Pat Shafto and Baxter Eaves from the CoCoSci Lab at University of Louisville.

Our work is generously supported by research a contract with DARPA (under the XDATA program) and the Google "Rethinking AI" project. The views expressed on this website and in our research are our own, and do not necessarily reflect the views of our government or corporate sponsors.

If you are interested in alpha or beta testing BayesDB, please get in touch with us through this form: BayesDB Tester Signup Form.

To see what people are saying about BayesDB, follow us on Twitter: Follow @jaynbaxter Follow @vmansinghka Follow @patrickshafto Follow @BaxterEaves

If you have any comments or questions, please feel free to email us at bayesdb [AT] mit.edu.