# BQL: Bayesian Query Language¶

BQL is a probabilistic extension to SQL that supports

- standard deterministic SQL queries,
- estimating probabilities and strengths of relationships between rows and columns,
- inferring missing values within confidence intervals, and
- simulating data from generative models.

BQL does not currently include any of the SQL Data Modification
Language (DML) – `INSERT`

, `UPDATE`

, and other commands to modify
the contents of tables – and includes only limited subset of the SQL
Data Definition Language (DDL) – `CREATE TABLE`

and commands to
modify the database schema.

A BQL phrase is either a command or a query. In contexts admitting multiple BQL phrases, e.g. at the bayeslite shell, each phrase must be terminated by a semicolon before the next one begins.

Expressions in BQL are like SQL, and may involve standard arithmetic,
SQL functions such as `IFNULL`

. Expressions in BQL may additionally
involve model functions such as `PREDICTIVE PROBABILITY OF <column>`

in appropriate contexts.

In the following syntax description, square brackets denote optional terms. For example, the pattern

`DEPENDENCE PROBABILITY [[OF <column1>] WITH <column2>]`

allows

`DEPENDENCE PROBABILITY`

`DEPENDENCE PROBABILITY WITH quagga`

`DEPENDENCE PROBABILITY OF eland WITH quagga`

but not `DEPENDENCE PROBABILITY OF eland`

.

## BQL Commands¶

BQL commands change the state of the database.

### Transactions¶

Transactions are groups of changes to a database that happen all at once or not at all. Transactions do not nest.

FUTURE: BQL will additionally support savepoints (Github issue #36), which are like transactions but may be named and nested.

`BEGIN`

Begin a transaction. Subsequent commands take effect within the transaction, but will not be made permanent until`COMMIT`

, and may be undone with`ROLLBACK`

.

`COMMIT`

End a transaction, and commit to all changes made since the last`BEGIN`

.

`ROLLBACK`

End a transaction, and discard all changes made since the last`BEGIN`

.

### Data Definition Language¶

The BQL DDL is currently limited to creating tables from the results of queries, and dropping and renaming tables.

The complete SQL DDL supported by sqlite3 is not supported by BQL. Note that one can always fall back to executing SQL instead of BQL in Bayeslite.

`CREATE [TEMP|TEMPORARY] TABLE [IF NOT EXISTS] <name> FROM '<pathname>'`

Create a table namednamefrom the csv file atpathname. Note thatpathnameis a string, and should be surrounded by single quotes.

`CREATE [TEMP|TEMPORARY] TABLE [IF NOT EXISTS] <name> AS <query>`

Create a table namednameto hold the results of the queryquery.

`DROP TABLE [IF EXISTS] <name>`

Drop the table

nameand all its contents.May fail if there are foreign key constraints that refer to this table.

`ALTER TABLE <name> <alterations>`

Alter the specified properties of the table

name. Thealterationsare a comma-separated list of alterations. The following alterations are supported:

`RENAME TO <newname>`

Change the table’s name to

newname. Foreign key constraints are updated; triggers and views are not, and must be dropped and recreated separately, due to limitations in sqlite3.If table

namehas an implicit population, then both the table and its implicit population will be renamed tonewname.FUTURE: Renaming columns (Github issue #35).

### Metamodeling Language (MML)¶

#### Populations¶

A BQL *population* is a named database object which contains a collection of
*variables* and their *statistical data types*. The variables of a population
correspond to (a subset of the) columns in a given SQL table *t*, known as the
‘’base table’’ of the population. Analogously to each column in a SQL table
having a data type (such as `INT`

, `FLOAT`

, or `VARCHAR`

) which determines
what kind of data can be stored for that column, every variable in a population
has a statistical data type (such as `NUMERICAL`

, `COUNT`

, `MAGNITUDE`

, or
`NOMINAL`

) which determines which probabilistic models are applicable to that
variable.

**Note**: While the terms ‘column’ and ‘variable’ are often used exchangeably,
formally a ‘column’ belongs to a SQL table, whereas a ‘variable’ belongs to a
BQL population.

`CREATE POPULATION [IF NOT EXISTS] [<pop>] FOR <table> WITH SCHEMA (<schema>)`

Create a population named

popwith basetableand statistical data types given byschema.If the population name

popis not specified, then an “implicit” population will be created for the table, whose name is the same as the table name. Note that a table with an implicit population cannot have multiple populations.The

schemais defined using any combination of the following statements, separated by semicolons:

`GUESS STATTYPE(S) OF (<column(s)>)`

Guess the statistical type for the given comma-separated list ofcolumn(s), using data-dependent heuristics. Use (*) to indicate all columns in the table.

`SET STATTYPE(S) OF <column(s)> TO <stattype>`

Set the statistical data type ofcolumn(s)tostattype.

`IGNORE <column(s)>`

Ignorecolumn(s); no variable in the population will be created for these columns in the base table

`DROP POPULATION [IF EXISTS] <pop>`

Drop populationpopand all its contents. Will fail if there are still generators associated with this population.

`ALTER POPULATION <pop>`

Alter the specified properties of

pop. The following alterations are supported:

`RENAME TO <newname>`

Change the population’s name to

newname.If

popis an implicit population for a base tablepop, this command will fail. Instead, use ALTER TABLE <pop> RENAME TO <newname>, which will result in renaming both the base table and its implicit population tonewname.If population

pophas an implicit generator, then both the population and its implicit generator will be renamed tonewname.

`ADD VARIABLE <varname> [<stattype>]`

Add the given variable to the population, optionally specifying its statistical data type. If unspecified, the statistical type will be heuristically guessed.

Note that

varnamemust correspond to an existing column in the base table of the population; it is either a column that was specified as IGNORE when creating the population, or a column that was added later using e.g. the SQL command ALTER TABLE <t> ADD COLUMN.

`SET STATTYPE OF <variable(s)> TO <stattype>`

Change the statistical type of the givenvariable(s)tostattype.

`GUESS SCHEMA FOR <table>`

Guess a population schema fortable. The schema maps each column intableto its guessed statistical type, and gives the heuristic reason for the guess. Columns intablewill be guessed to be`NOMINAL`

,`NUMERICAL`

or`IGNORE`

. The query yields a table with three columns:`name`

,`stattype`

, and`reason`

.

#### Generators¶

A BQL *generator* is a generative probabilistic model which describes the joint
distribution of all the variables in a given base population.

`CREATE GENERATOR [IF NOT EXISTS] [<gen>] FOR <population> [USING <backend>] (<customization>)`

Create generator

genforpopulation, optionally specifying whichbackendto use.If the generator name

genis not specified, then an “implicit” generator will be created for the population, whose name is the same as the population name. Note that a population with an implicit generator cannot have multiple generators.The default backend is

`cgpm`

, which uses CrossCategorization as the default generative model. This backend supports the followingcustomizationstatements for overriding parts of the default model:

`OVERRIDE GENERATIVE MODEL FOR <variable(s)> [GIVEN <variable(s)>] USING <predictor>`

Usepredictoras the generative model for the specified (conditional) distribution.

`SUBSAMPLE(<nrows>)`

Use a randomly chosen subsample ofnrowsrows from the base table of the population to use for fitting the generative model.

`INITIALIZE <n> MODELS [IF NOT EXISTS] FOR <g>`

Initialize an ensemble of

nmodels for the generatorg.Each model can be thought of as a different sample of all unknown parameters specified by the generative model of the generative model. For example, if the generator used is Bayesian factor analysis, then each model may correspond to a different posterior sample of the factor loading matrix.

Using

`IF NOT EXISTS`

will initialize all models in the range 0 tonum -1that do not already exist.

`ANALYZE <g> [MODELS (<indexes>)] FOR <duration> [CHECKPOINT <duration>] (<customization>)`

Perform analysis on models in generator

g. An optional subset of models can be specified by giving theirindexes; by default, analysis will be applied to all models. Thedurationcan take on values of`<n> SECOND(S)`

,`<n> MINUTE(S)`

, or`<n> ITERATION(S)`

. The`FOR`

duration specifies how long to perform analysis. The`CHECKPOINT`

duration specifies how often to commit the intermediate results of analysis to the database on disk.When the generator is created using the default

`cgpm`

backend, then the following semicolon-separatedcustomizationcommands are supported:

`OPTIMIZED`

Use a faster MCMC implementation for fitting CrossCat-modeled variables.

`QUIET`

Suppress progress bar.

`SKIP <variables>`

Analyze all variables in the population, except for the comma-separated list ofvariables.

`VARIABLES <variables>`

Analyze only the comma-separated list ofvariables.

`ROWS <rows>`

Analyze only the specified rows.

`SUBPROBLEMS (VARIABLE HYPERPARAMETERS, VARIABLE CLUSTERING, VARIABLE CLUSTERING CONCENTRATION, ROW CLUSTERING, ROW CLUSTERING CONCENTRATION)`

Specify an optional set of CrossCat subproblems to apply analysis to. By default, analysis will cycle randomly through all subproblems.

`DROP [[MODEL <num>] | [MODELS <num0>-<num1>] FROM] GENERATOR [IF EXISTS] <g>`

Drop the generatorgand all its contents. Optionally, drop only the model numberednum, or the models ranging fromnum0tonum1.

`ALTER GENERATOR <gen>`

Alter the specified properties of

gen. The following alterations are supported:

`RENAME TO <newname>`

Change the generator’s name to

newname.If

genis an implicit generator for a base population, this command will fail. Instead, use ALTER POPULATION <gen> RENAME TO <newname>, which will result in renaming both base population and its implicit generator tonewname.

## BQL Queries¶

`SELECT <columns>`

Standard SQL constant`SELECT`

: yield a single row by evaluating the specified columns.

`SELECT [DISTINCT|ALL] <columns> FROM <table> [WHERE <condition>] [GROUP BY <grouping>] [ORDER BY <ordering>] [LIMIT <limit>]`

Standard SQL

`SELECT`

. Model estimators are not allowed, except in subqueries of types that allow them.

`<columns>`

Comma-separated list of BQL expressions, each with an optional`AS <name>`

to name the column in the resulting table.

`FROM <table>`

Thetableis a comma-separated list of table names or subqueries, each with an optional`AS <name>`

to qualify the table name in references to its columns. When multiple tables are specified separated by commas, their join (cartesian product) is selected from.

`WHERE <condition>`

Theconditionis a BQL expression selecting a subset of the input rows fromtablefor which output rows will be computed.

`GROUP BY <grouping>`

Thegroupingis a BQL expression specifying a key on which to group output rows. May be the name of an output column with`AS <name>`

incolumns.

`ORDER BY *expression* [ASC|DESC]`

Theexpressionis a BQL expression specifying a key by which to order output rows, after grouping if any. Rows are yielded in ascending order of the key by default or if`ASC`

is specified, or in descending order of the key if`DESC`

is specified.

`LIMIT <n> [OFFSET <offset>]`

or`LIMIT <offset>, <n>`

Bothnandoffsetare BQL expressions. Only up ton(inclusive) rows are returned after grouping and ordering, starting atoffsetfrom the beginning.

`ESTIMATE <expression> BY <population>`

Like constant`SELECT`

, extended with model estimators of one implied row.

`ESTIMATE [DISTINCT|ALL] <expression> FROM <population> [MODELED BY <g>] [USING [MODEL <num>] [MODELS <num0>-<num1>]] [WHERE <condition>] [GROUP BY <grouping>] [ORDER BY <ordering>] [LIMIT <limit>]`

Like`SELECT`

on the table associated withpopulation, extended with model estimators of one implied row.

`ESTIMATE <expression> FROM VARIABLES OF <population> [MODELED BY <g>] [USING [MODEL <num>] [MODELS <num0>-<num1>]] [WHERE <condition>] [GROUP BY <grouping>] [ORDER BY <ordering>] [LIMIT <limit>]`

Like`SELECT`

on the modeled columns ofpopulation, extended with model estimators of one implied column.

`ESTIMATE <expression> FROM PAIRWISE VARIABLES OF <population> [FOR <subcolumns>] [MODELED BY <g>] [USING [MODEL <num>] [MODELS <num0>-<num1>]] [WHERE <condition>] [ORDER BY <ordering>] [LIMIT <limit>]`

Like

`SELECT`

on the self-join of the modeled columns ofpopulation, extended with model estimators of two implied columns.In addition to a literal list of column names, the list of

subcolumnsmay be an`ESTIMATE * FROM VARIABLES OF`

subquery.

`ESTIMATE <expression> FROM PAIRWISE <population> [MODELED BY <g>] [USING [MODEL <num>] [MODELS <num0>-<num1>] [WHERE <condition>] [ORDER BY <ordering>] [LIMIT <limit>]`

Like

`SELECT`

on the self-join of the table associated withpopulation, extended with model estimators of two implied rows.(Currently the only

expressionfunctions of two implied rows are`SIMILARITY`

and`SIMILARITY IN THE CONTEXT OF (...)`

.)

`INFER <colnames> [WITH CONFIDENCE <conf>] FROM <population> [MODELED BY <g>] [USING [MODEL <num>] [MODELS <num0>-<num1>]] [WHERE <condition>] [GROUP BY <grouping>] [ORDER BY <ordering>] [LIMIT <limit>]`

Select the specified

colnamesfrompopulation, filling in missing values if they can be filled in with confidence at leastconf, a BQL expression. Only missing valuescolnameswill be filled in; missing values in columns named incondition,grouping, andorderingwill not be. Model estimators and model predictions are allowed in the expressions.The

colnamesis a comma-separated list of column names,notarbitrary BQL expressions.

`INFER EXPLICIT <expression> FROM <population> [MODELED BY <g>] [USING [MODEL <num>] [MODELS <num0>-<num1>]] [WHERE <condition>] [GROUP BY <grouping>] [ORDER BY <ordering>] [LIMIT <limit>]`

Like

`SELECT`

on the table associated withpopulation, extended with model estimators of one implied row and with model predictions.In addition to normal

`SELECT`

columns,expressionmay include:`PREDICT <name> [AS <rename>] CONFIDENCE <confname>`

This results in two resulting columns, one named

rename, ornameifrenameis not supplied, holding a predicted value of the columnname, and one namedconfnameholding the confidence of the prediction.

`SIMULATE <colnames> FROM <population> [MODELED BY <g>] [USING [MODEL <num>] [MODELS <num0>-<num1>]] [GIVEN <constraints>] [LIMIT <limit>]`

Select the requested

colnamesfrom rows sampled frompopulation. Theconstraintsis a comma-separated list of constraints of the form`<colname> = <expression>`

representing equations that the returned rows satisfy.

The number of rows in the result will be

limit.

## BQL Expressions¶

BQL expressions, like SQL expressions, may name columns, include query
parameters, use standard arithmetic operators, and use SQL functions
such as `ABS(<x>)`

, as documented in the SQLite3 Manual.

In addition, BQL expressions in `ESTIMATE`

and `INFER`

queries may
use model estimators, and BQL expressions in `INFER`

queries may use
model predictions.

### Model Estimators¶

Model estimators are functions of a model, up to two columns, and up to one row.

WARNING: Due to limitations in the sqlite3 query engine that bayeslite relies on (Github issue #308), repeated references to a model estimator may be repeatedly evaluated for each row, even if they are being stored in the output of queries. For example,

ESTIMATE MUTUAL INFORMATION AS "mutinf" FROM PAIRWISE VARIABLES OF p ORDER BY "mutinf"

has the effect of estimating mutual information twice for each row because it is mentioned twice, once in the output and once in the ORDER BY, which is twice as slow as it needs to be. (Actually, approximately four times, because mutual information is symmetric, but that is an orthogonal issue.)

To avoid this double evaluation, you can order the results of a subquery instead:

SELECT * FROM ( ESTIMATE MUTUAL INFORMATION AS "mutinf" FROM PAIRWISE VARIABLES OF p ) ORDER BY "mutinf"

`PREDICTIVE PROBABILITY OF <column> [GIVEN (<column(s)>)]`

Function of one implied row. Returns the predictive probability of the row’s value for the column namedcolumn, optionally given the data incolumn(s)in the row.

`PROBABILITY DENSITY OF <column> = <value> [GIVEN (<constraints>)]`

`PROBABILITY DENSITY OF (<targets>) [GIVEN (<constraints>)]`

Constant. Returns the probability density of the value of the BQL expression

valuefor the columncolumn. Iftargetsis specified instead, it is a comma-separated list of`<column> = <value>`

terms, and the result is the joint density for all the specified target column values.If

constraintsis specified, it is also a comma-separated list of`<column> = <value>`

terms, and the result is the conditional joint density given the specified constraint column values.WARNING: The value this function returns is not a normalized probability in [0, 1], but rather a probability density with a normalization constant that is common to the column but may vary between columns. So it may take on values above 1.

`PROBABILITY DENSITY OF VALUE <value> [GIVEN (<constraints>)]`

Function of one implied column. Returns the probability density of the value of the BQL expressionvaluefor the implied column. Ifconstraintsis specified, it is a comma-separated list of`<column> = <value>`

terms, and the result is the conditional density given the specified constraint column values.

`SIMILARITY [OF (<boolexpr0>)] [TO (<boolexpr1>)] IN THE CONTEXT OF <column>`

Constant, or function of one or two implied rows. If given both`OF`

and`TO`

, returns a constant measure of similarity between the first row satisfied byboolexpr0and the first row satisfied byboolexpr1. If given only`TO`

returns a measure of the similarity of the implied row with the first row satisfyingboolexpr1. Otherwise, returns a measure of the similarity of the two implied rows. The similarity may be considered within the context of a column.

`PREDICTIVE RELEVANCE [OF (<boolexpr0>)] TO EXISTING ROWS (<boolexpr1>) IN THE CONTEXT OF <column>`

`PREDICTIVE RELEVANCE [OF (<boolexpr0>)] TO HYPOTHETICAL ROWS (<boolexpr1>) IN THE CONTEXT OF <column>`

`PREDICTIVE RELEVANCE [OF (<boolexpr0>)] TO EXISTING ROWS (<boolexpr1>) AND HYPOTHETICAL ROWS (<boolexpr2>) IN THE CONTEXT OF <column>`

If given`OF`

, returns a measure of predictive relevance of the first row satisfyingboolexpr0for the existing and/or hypothetical rows satisfyingboolexpr1(andboolexpr2in the case of both) in the context ofcolumn. Otherwise, returns a measure of predictive relevance of all rows to the specified existing and/or hypothetical rows.

`CORRELATION [[OF <column1>] WITH <column2>]`

Constant, or function of one or two implied columns. Returns standard measures of correlation between columns:

- Pearson correlation coefficient squared for two numerical columns.
- Cramer’s phi for two nominal columns.
- ANOVA R^2 for a nominal column and a numerical column.
Cyclic columns are not supported.

`DEPENDENCE PROBABILITY [[OF <column1>] WITH <column2>]`

Constant, or function of one or two implied columns. Returns the probability (density) that the two columns are dependent.

`MUTUAL INFORMATION [[OF <column1>] WITH <column2>] [USING <n> SAMPLES]`

Constant, or function of one or two implied columns. Returns the strength of dependence between the two columns, in units of bits.

If

`USING <n> SAMPLES`

is specified and the underlying generator uses Monte Carlo integration for each model to estimate the mutual information (beyond merely the integral averaging all generators), the integration is performed usingnsamples for each model.

### Model Predictions¶

`PREDICT <column> [WITH CONFIDENCE <confidence>]`

Function of one implied row. Samples a value for the column namedcolumnfrom the model given the other values in the row, and returns it if the confidence of the prediction is at least the value of the BQL expressionconfidence; otherwise returns null.