If you would like to analyze your own data with BayesDB, please contact bayesdb@mit.edu to participate in our research project.
CREATE GENERATOR <name> [IF NOT EXISTS] FOR <table> USING <metamodel> (<schema>)
Create a generative model named name for the table named table in the language of metamodel. Schema describes the generative model in syntax that depends on the metamodel. Typically, it is a comma-separated list of clauses of the form
<column> <type>requesting the column column to be modelled with the statistical type type, with some additional types of clauses. For example,
CREATE GENERATOR t_cc FOR t USING crosscat ( SUBSAMPLE(1000), -- Subsample down to 1000 rows; GUESS(*), -- guess all column types, except name IGNORE, -- ignore the name column, and angle CYCLIC -- treat angle as CYCLIC. )
DROP GENERATOR [IF EXISTS] <name>
Drop the generator named name and all its models.
ALTER GENERATOR <name> <alterations>
Alter the specified properties of the generator named name. Alterations is a comma-separated list of alterations. The following alterations are supported:
RENAME TO <newname>
Change the generator’s name to newname.
INITIALIZE <n> MODEL[S] [IF NOT EXISTS] FOR <name>
Perform metamodel-specific initialization of up to n models for the generator named name. n must be a literal integer. If the generator already had models, the ones it had are unchanged. Models are zero-indexed.
DROP MODELS [<modelset>] FROM <name>
Drop the specified models from the generator named name. Modelset is a comma-separated list of model numbers or hyphenated model number ranges, inclusive on both bounds. If modelset is omitted, all models are dropped from the generator.
Example:
DROP MODELS 1-3 FROM t_ccEquivalent:
DROP MODEL 1 FROM t_cc; DROP MODEL 2 FROM t_cc; DROP MODEL 3 FROM t_cc
ANALYZE <name> [MODEL[S] <modelset>] [FOR <duration>] [CHECKPOINT <duration>] WAIT
Perform metamodel-specific analysis of the specified models of the generator name. Modelset is a comma-separated list of model numbers or hyphenated model number ranges. Duration is either <n> SECOND[S], <n> MINUTE[S], or <n> ITERATION[S].
The FOR duration specifies how long to perform analysis. The CHECKPOINT duration specifies how often to commit the intermediate results of analysis to the database on disk.
Examples:
ANALYZE t_cc FOR 10 MINUTES CHECKPOINT 30 SECONDS
ANALYZE t_cc MODELS 1-3,7-9 FOR 10 ITERATIONS CHECKPOINT 1 ITERATION
Metamodel interface.
To be used to model data in a bayeslite.BayesDB handle, a metamodel must first be registered with bayesdb_register_metamodel().
The Crosscat metamodel is registered by default, but we can suppress that for illustration:
import bayeslite
import crosscat.LocalEngine
from bayeslite.metamodels.crosscat import CrosscatMetamodel
bdb = bayeslite.bayesdb_open(pathname='foo.bdb', builtin_metamodels=False)
cc = crosscat.LocalEngine.LocalEngine(seed=0)
bayeslite.bayesdb_register_metamodel(bdb, CrosscatMetamodel(cc))
Then you can model a table with Crosscat and query the probable implications of the data in the table:
bdb.execute('create generator t_cc for t using crosscat(guess(*))')
bdb.execute('initialize 10 models for t_cc')
bdb.execute('analyze t_cc for 10 iterations wait')
for x in bdb.execute('estimate pairwise dependence probablity from t_cc'):
print x
BayesDB metamodel interface.
Subclasses of IBayesDBMetamodel implement the functionality needed by probabilistic BQL queries to sample from and inquire about the posterior distribution of a generative model conditioned on data in a table. Instances of subclasses of IBayesDBMetamodel contain any in-memory state associated with the metamodel in the database.
Add colno from the population as a variable in the metamodel.
Used by the MML:
ALTER POPULATION <population> ADD VARIABLE <variable> <stattype>
Analyze the specified model numbers of a generator.
If none are specified, analyze all of them.
Parameters: |
|
---|
Compute DEPENDENCE PROBABILITY OF <col0> WITH <col1>.
Compute MUTUAL INFORMATION OF (<cols0>) WITH (<cols1>).
Create a generator for a table with the given schema.
Called when executing CREATE GENERATOR.
Must parse schema to build the generator.
The generator id and column numbers may be used to create metamodel-specific records in the database for the generator with foreign keys referring to the bayesdb_generator and bayesdb_generator_column tables.
schema is a list of schema items corresponding to the comma-separated ‘columns’ from a BQL CREATE GENERATOR command. Each schema item is a list of strings or lists of schema items, corresponding to whitespace-separated tokens and parenthesized lists. Note that within parenthesized lists, commas are not excluded.
Drop any metamodel-specific records for a generator.
Called when executing DROP GENERATOR.
Drop the specified model numbers of a generator.
If none are specified, drop all models.
Initialize the specified model numbers for a generator.
Evalute the joint probability of targets subject to constraints.
Returns the probability density of the targets (in log domain).
rowid is an integer.
targets is a list of (colno, value) pairs.
constraints is a list of (colno, value) pairs.
modelno is a model number or None, meaning all models.
Return the name of the metamodel as a str.
Predict a value for a column, if confidence is high enough.
Predict a value for a column and return confidence.
Compute predictive relevance, also known as relevance probability.
rowid_target is an integer.
rowid_query is a list of integers.
Install any state needed for the metamodel in bdb.
Called by bayeslite.bayesdb_register_metamodel().
Normally this will create SQL tables if necessary.
Note that a table column has been renamed.
Not currently used. To be used in the future when executing:
ALTER TABLE <table> RENAME COLUMN <oldname> TO <newname>
Compute SIMILARITY TO <target_row> for given rowid.
Switch between multiprocessing and single processing.
The boolean variable switch toggles between single (False) and multi (True) processing, if the choice is available, and otherwise ignores the request.
Simulate targets from a generator, subject to constraints.
Returns a list of lists of values for the specified targets.
rowid is an integer.
modelno may be None, meaning “all models”
targets is a list of (colno).
constraints is a list of (colno, value) pairs.
num_samples is the number of results to return.
accuracy is a generic parameter (usually int) which specifies the desired accuracy, compute time, etc if the simulations are approximately distributed from the true target.
The results are samples from the distribution on targets, independent conditioned on (the latent state of the metamodel and) the constraints.
Deregister metamodel, which must have been registered in bdb.
Register all builtin metamodels in bdb.
Register metamodel in bdb, creating any necessary tables.
metamodel must not already be registered in any BayesDB, nor any metamodel by the same name.
Crosscat is a fully Bayesian nonparametric method for analyzing heterogeneous, high-dimensional data, described at http://probcomp.csail.mit.edu/crosscat/.
This module implements the bayeslite.IBayesDBMetamodel interface for Crosscat.
Crosscat metamodel for BayesDB.
Parameters: | crosscat – Crosscat engine. |
---|
The metamodel is named crosscat in BQL:
CREATE GENERATOR t_cc FOR t USING crosscat(...)
Internally, the Crosscat metamodel adds SQL tables to the database with names that begin with bayesdb_crosscat_.
Heuristic guessing of statistical types based on data.
The heuristics implemented here are ad-hoc, and do not implement any sort of Bayesian model selection. They are based on crude attempts to parse data as numbers, and on fixed parameters for distinguishing nominal and numerical data. No columns are ever guessed to be cyclic.
Heuristically guess a population schema for table.
Based on the data in table, create a population named population.
Parameters: |
|
---|
In addition to statistical types, the overrides may specify key or ignore, in which case those columns will not be modelled at all.
Heuristically guess statistical types for the data in rows.
Return a list of (statistical type, reason) corresponding to the columns named in the list column_names.
Parameters: |
|
---|
In addition to statistical types, the overrides may specify key or ignore.