< Prev: More VentureScript: Linear Regression ^ Top

Advanced Curve Fitting: Mixture Models and Gaussian Processes

In this chapter, we will take a dive into some more advanced modeling idioms. We will move faster than we did in previous chapters. You can skim this for a flavor of the sorts of things that are possible in VentureScript; and/or study the programs and refer to the manual for a deeper understanding of what's going on.

We will also see more examples of extending Venture by writing and loading plugins. Plugins and foreign procedures generally are very much an intentional part of using VentureScript, both to take advantage of existing deterministic libraries and to interoperate with custom or legacy probabilistic codes.

Contents

Getting Started

As a reminder, you can start an interactive Venture session with

$ venture

You can also run a file such as script.vnts using

$ venture -f script.vnts

A Simple Mixture Model

One simple way to define a mixture model is to choose a mixture component at top-level, and define each component in a separate code path. (This code block is downloadable as gp-1.vnts).

assume mix_components = array(
    proc() { uniform_continuous(0, 10) },  // component 0
    proc() { uniform_continuous(0, 5) },   // component 1
    proc() { uniform_continuous(0, 2) }    // component 2
    );

assume mix_weights = simplex(0.3, 0.3, 0.4);

assume comp_index = tag(quote(comp_index), 0, categorical(mix_weights, list(0,1,2)));
assume mix_f = lookup(mix_components, comp_index);

sample mix_f()
0.334765969036

Here the value of mix_f() is "doubly random" -- the procedure mix_f is randomly chosen, and after mix_f is fixed, the output of a call to mix_f is also random. Equivalently, the value of mix_f() follows a mixture model with mixture components mix_components and corresponding weights mix_weights.

It can be instructive to examine the effect of observations. The prior on mixture components is:

plot("h0",
 run(accumulate_dataset(100,
  do(reset_to_prior,
     collect(comp_index)))))
[]

A Venture plot

Now let's make some observations

observe mix_f() = 3.3;
observe mix_f() = 3.1;
observe mix_f() = 4.5;
[-inf]

and examine the posterior

plot("h0",
 run(accumulate_dataset(100,
  do(reset_to_prior,
     mh(quote(comp_index), one, 20),
     collect(comp_index)))))
[]

A Venture plot

As we would expect, the possibility of uniform_continuous(0, 2) (component 2) has been eliminated, as the sampled values of mix_f() are not possible in that component. The remaining components are uniform_continuous(0, 10) and uniform_continuous(0, 5).

Notice that uniform_continuous(0, 10) (component 0) is much less probable than uniform_continuous(0, 5) (component 1) in the posterior, whereas it was equally probable in the prior. This is a case of the Bayesian Occam's razor: among models that agree with the data, Bayesian inference automatically favors more parsimonious ones, because they are better able to concentrate predictive mass. Said differently, a specific prediction is more impressive to get confirmed.

In the present case this shows up thus: if mix_f were uniform_continuous(0, 10), wouldn't it be a suspicious coincidence that three independent samples of mix_f() were all between 3 and 5? The coincidence would be much (in fact, exactly 2^(#data) times) less suspicious if mix_f were uniform_continuous(0, 5).

Regression With a Mixture

Taking the above one step further, we can use a mixture model to perform regression. Feel free to clear your session

clear

We are going to use a small plugin to make a custom plot. Download it and load it into VentureScript with

run(load_plugin("multiline_plot_plugin.py"));
[]

The following code (gp-2.vnts) asks, "what is the posterior on models that are either linear or sine wave, conditioned on the given data?"

// Coefficient priors
assume a = uniform_continuous(-2, 2);
assume b = uniform_continuous(-2, 2);
assume c = uniform_continuous(-2, 2);
assume d = uniform_continuous(0, 3);
assume e = uniform_continuous(0, 2*3.14159);

// Mixture
assume components = array(
    proc(x) { a*x + b },
    proc(x) { c*sin(d*x + e) }
    );
assume weights = simplex(0.5, 0.5);
assume true_f = categorical(weights, components);

// Data (inline, in this case)
define data_xs = array(8.73, 7.47, 0.11, 0.26, 3.94, 4.86, 4.07, 5.65, 9.1, 6.62, 9.98, 3.54, 8.12, 5.17, 5.39);
define data_ys = array(0.53, 0.67, 0.23, 0.02, 0.13, 0.27, 0.33, 0.04, 0.68, 0.17, 0.1, 0.16, 0.53, 0.42, 0.4);
define data = zip(data_xs, data_ys);

// Function for programmatically observing the data
define obs = proc(xy) {
    x = lookup(xy, 0);
    y = lookup(xy, 1);
    observe normal(true_f($x), 1.0) = y
};
run(for_each(data, obs));

// Collect several snapshots of the progress of inference
define xs = linspace(0, 10, 200);
define yss = array_from_dataset(run(accumulate_dataset(10,
    do(mh(default, one, 50),
       collect(mapv(true_f, unquote(xs)))))));

// Plot them
plot_lines(xs, yss, data_xs, data_ys, -2, 2, 0.5, 1.0, 1.0);
[]

Plots of f as inference progresses

The plot contains samples of the function true_f as inference progresses, 50 steps at time. As time progresses, the hue changes from cyan to blue to purple to red. Notice that neither a line nor a sine wave is a perfect fit for this data set; lines incur a penalty due to the waviness of the data and sine waves incur a penalty due to the rising pattern of the data.

(The function plot_lines(xs, yss, data_xs, data_ys, ymin, ymax, huemin, huemax, linewidth) works as follows: scatterplot of data_xs versus data_ys, as well as line plots of ys versus xs for each ys in yss. As the line plots progress from the first entry of yss to the last entry of yss, the hue of the drawn line progresses from huemin to huemax (with hues being on a scale from 0 to 1). The lines have width linewidth, with 1 being the normal width. The plot should show up in a matplotlib window.)

Progressive Mixture Regression

Another interesting case of mixture regression is when the mixture components are models of increasing power, where more powerful models are harder to tune/fit to the data. In that case mixture regression can be viewed as a way of training all the components in parallel, and a sample of the computed (approximate) posterior consists of a choice of model along with values of the parameters for that model. We now present such an example, with observation data as shown below:

Observation data

We have drawn in curves for each of the three mixture components we will be using in our mixture model: y = 0.9 (blue), y = 1/x (green), and y = x^(-r) (magenta, with parameter value r = 0.6). Now let's clear our session

clear

and look at a probabilistic program for regression using this mixture (gp-3.vnts):

// Parameter for component 2
assume r = tag(quote(r), 0, beta(2,2));

// Mixture
assume pcomponents = array(
    proc(x) { 0.9 },           // component 0
    proc(x) { pow(x, -1) },    // component 1
    proc(x) { pow(x, 0 - r) }  // component 2
    );
assume pweights = simplex(0.3, 0.3, 0.4);
assume pnum = tag(quote(pnum), 0, categorical(pweights, list(0,1,2)));
assume pf = lookup(pcomponents, pnum);

// Data
define xs = array(0.67, 3.58, 2.52, 1.09, 0.7, 0.4);
define ys = array(1.19, 0.47, 0.67, 0.9, 1.09, 1.67);
define data = zip(xs, ys);

// Observe data points
define obs_pf = proc(xy) {
    x = lookup(xy, 0);
    y = lookup(xy, 1);
    observe normal(pf($x), 1.0) = y
};
run(for_each(data, obs_pf));

// Do inference, and plot the inferred values of r and pnum over time
define dataset = run(accumulate_dataset(100,
    do(if(eq(run(sample(pnum)), atom<2>)) { mh(quote(r), one, 10) } else { pass },
       mh(quote(pnum), one, 10),
       collect(pnum, r))));
plot("p0d1dcd", dataset);
[]

A Venture plot

Above, the parameter r is plotted against pnum (where pnum = 0 for y = 0.3, pnum = 1 for y = 1/x, and pnum = 2 for y = x^(-r)). The darker points correspond to later in time, i.e. after more inference steps.

Note that the inference loop contains the instruction

if(eq(run(sample(pnum)), atom<2>)) { mh(quote(r), one, 10) } else { pass }

This says, if the current value of pnum is 2 (so that y = x^(-r)), then do inference on r, and otherwise, do not touch r. This is a sensible thing to do since the likelihood is indifferent to r if pnum != 2, so any inference steps taken on r when pnum != 2 would retrogress on inference progress. Thus we see a case in which fully programmable inference is beneficial.

Subtle: It is tempting to think of inference on this mixture model as a search: first to find the "right" component, then to find the "right" parameters for that component. That cartoon often doesn't correctly capture the Bayesian picture of inference. The posterior probability of a component is the average goodness (in terms of likeilhood) of using that component with parameters drawn randomly from the prior. So the question of whether one particular max a-posteriori (MAP) solution will dominate the posterior depends not just on (1) how likely the MAP solution is, but also on (2) the prior on parameters, and (3) how wide a region of very-good solutions surround the MAP, which is related to how sensitive the likelihood is to small changes in the parameters. In the above example, r is the only parameter.

Exercise:

  1. Compute a coarse approximation to the posterior on mixture components. Does this model make a clear choice on this data set? (show answer)

    Answer:

    infer resample(10);
    plot("h0", run(accumulate_dataset(10,
     do(conditional,
        collect(pnum)))))
    
    []

    A Venture plot

  2. Come up with a prior on r such that component 2 (x^(-r)) dominates the posterior. (show answer)

    Answer: Setting r to be deterministically the "right" value will favor that component. Of course, since the "right" value is in this case inferred from the data (by eyeballing), this risks overfitting.

    observe r = 0.6;
    infer resample(10);
    plot("h0", run(accumulate_dataset(10,
     do(conditional,
        collect(pnum)))))
    
    []

    A Venture plot

  3. Come up with a prior on r such that component 2 is very improbable in the posterior (but don't cheat -- the region from r = 0.4 to r = 0.8 should still have positive probability). (show answer)

    Answer: Putting a very broad prior on r lets us produce an arbitrarily severe Bayes Occam's Razor penalty against it.

    clear
    
    assume r = uniform_continuous(-100, 100);
    
    
    // Mixture
    assume pcomponents = array(
       proc(x) { 0.9 },           // component 0
       proc(x) { pow(x, -1) },    // component 1
       proc(x) { pow(x, 0 - r) }  // component 2
       );
    assume pweights = simplex(0.3, 0.3, 0.4);
    assume pnum = tag(quote(pnum), 0, categorical(pweights, list(0,1,2)));
    assume pf = lookup(pcomponents, pnum);
    
    // Data
    define xs = array(0.67, 3.58, 2.52, 1.09, 0.7, 0.4);
    define ys = array(1.19, 0.47, 0.67, 0.9, 1.09, 1.67);
    define data = zip(xs, ys);
    
    // Observe data points
    define obs_pf = proc(xy) {
       x = lookup(xy, 0);
       y = lookup(xy, 1);
       observe normal(pf($x), 1.0) = y
    };
    run(for_each(data, obs_pf));
    
    
    
    infer resample(10);
    plot("h0", run(accumulate_dataset(10,
     do(conditional,
        collect(pnum)))))
    
    []

    A Venture plot

  4. (Open ended) What other parts or aspects of the model have a describable effect on the behavior of inference?

Gaussian Process Regression

While the above examples rergess nonlinearly on a data set, the mixture components are still hard-coded parametric models, and as such, the amount of modeller/programmer resources required to write them grows with the complexity of the model.

We will now look at Gaussian process models, which "let the data speak for itself" in the sense that the "effective dimensionality" of the model grows with the number of data points. For an introduction to Gaussian processes viewed as random functions, see Section 2.2 of Rasmussen and Williams.

clear

This segment requires a more elaborate plugin, where we implement a Gaussian process as a foreign stochastic procedure. Load it with

run(load_plugin("multiline_plot_plugin.py"));
run(load_plugin("gpexample_plugin.py"));
[]

and set up the Gaussian process like so

// Covariance function: squared exponential with length-scale 0.8,
// scaled by an squared output scale of 1.2
assume se = gp_cov_scale(1.2, gp_cov_se(0.8*0.8));

// Mean function
assume zero_func = gp_mean_const(0.0);

// Gaussian process
assume g = make_gp(zero_func, se);
[]

The Gaussian process g can be thought of as a random function whose prior is a Gaussian process with mean zero and covariance function se, that is, squared exponential:

prior_covariance(g(x1), g(x2)) = 1.2 * exp((x1-x2)^2 / (2 * 0.8^2))

We can observe the value of g at given data points:

// Data
assume data_xs = array(3.2, 2.2, 8.7);
define data_ys = array(8.7, 0.2, 5.5);
observe g(data_xs) = data_ys;

// Plot several samples of the gp
define xs = linspace(0, 10, 200);
define yss = array_from_dataset(run(accumulate_dataset(10,
    collect(g(unquote(xs))))));
plot_lines(xs, yss, run(sample(data_xs)), data_ys, -5, 12, 1.0, 1.0, 0.5);
[]

Posterior of toy GP

(The above two VentureScript blocks are gp-4.vnts.)

The posterior of g is again a Gaussian process (several samples of which are plotted above), with a different mean function and a different covariance function (see R&W for more on this). The inputs and outputs of g are arrays, because g samples from the joint distribution of the GP's values on all supplied inputs.

Exercise: Consider the following three ways to query a GP.

// Way 1
sample g(array(1.0, 2.0));

// Way 2
sample g(1.0);
sample g(2.0);

// Way 3
sample array(g(1.0), g(2.0));

[-2.180193189426972, -1.4073034685195343]
  1. Ascertain the distributions on values they produce. Are they the same or different? (show answer)

    Answer: Here's the distribution given by Way 2:

    plot("p0d1d", run(accumulate_dataset(500, collect(g(1.0), g(2.0)))))
    
    []

    A Venture plot

    Plotting Ways 1 and 3 is a little trickier, because collect doesn't (yet) have native support for vector-valued expressions (the use of labelled is needed to align the data):

    plot("p0d1d", run(accumulate_dataset(500,
     do(vals <- sample(array(g(1.0), g(2.0))),
        collect(labelled(lookup(quote(unquote(vals)), 0), g_1),
                labelled(lookup(quote(unquote(vals)), 1), g_2))))))
    
    []

    A Venture plot

    plot("p0d1d", run(accumulate_dataset(500,
     do(vals <- sample(g(array(1.0, 2.0))),
        collect(labelled(lookup(quote(unquote(vals)), 0), g_1),
                labelled(lookup(quote(unquote(vals)), 1), g_2))))))
    
    []

    A Venture plot

    Ways 1 and 3 produce the same multivariate Gaussian, whereas Way 2 produces independent samples.

  2. Why does that happen? (show answer)

    Answer: Let's take them one at a time.

    Way 1 is the "right" way to query a Gaussian process: supply all the test inputs at once. By definition, the GP returns a sample from a multivariate Gaussian determined by the mean and covariance functions applied to those points and previously queried ones.

    Way 2 exhibits an important property of VentureScript: a sample expression does not modify the model. So, the GP cannot "remember" the value sampled at x=1.0 when sampling at x=2.0, and produces independent samples.

    Way 3 is "right" again. Expressions evaluated in the same sample command can influence each other. Way 3 samples a value from the GP at x=1.0 conditioned on the observations, and then one at x=2.0 conditioned on the observations and the value at x=1.0.

Note that the choice of covariance function can itself be random (as long as all possible functions are symmetric and positive-semidefinite). For example, rather than defining se = make_squaredexp(1.2, 0.8), we could have defined se = make_squaredexp(sigma_prior(), l_prior()), where sigma_prior and l_prior are stochastic procedures. The inference engine could then be used to perform parameter estimation; in fact, we do this below, in the section on Bayesian optimization. It would even be possible to infer the structure of the covariance function, if we had in mind a generative model of covariance function structures.

Gaussian Process Memoization

In this segment, we will play with a new idiom for GP modeling that the Venture team recently developed language support for, called Gaussian process memoization, or gpmem. The idea is to wrap a function f into a package containing both a self-caching "prober" (which actually calls f) and a statistical emulator, which simulates f using a GP prior conditioned on the values of all previous invocations of the prober. The arguments to gpmem are the function f, a prior mean function for the GP-based emulator, and a prior covariance function. For example (gp-5.vnts):

clear
run(load_plugin("gpexample_plugin.py"));

// Function to model
assume f = make_audited_expensive_function("f");

// Mean and covariance functions for the GP emulator
assume zero_func = gp_mean_const(0.0);
assume se = gp_cov_scale(1.2, gp_cov_se(0.8*0.8));

// GPMem package
assume package = gpmem(f, zero_func, se);
assume f_probe = first(package);
assume f_emu = second(package);

// Test
assume ffive = f_probe(5);
assume fsix = f_probe(6);

sample f_emu(array(5.5))
[PROBE f] Probe #1: f(5.000000) = -0.391519
[PROBE f] Probe #2: f(6.000000) = -0.641769
[-1.32967053]

Note that the emulator expects an array as input (since it is a GP, as above), but the prober does not (since f is just a function, there is no need to "jointly" evaluate f at multiple points).

Our original motivation for this idiom was the case when f is too expensive to compute many times, so we need to learn using a limited number of samples. Later we realized that GP regression in general can be cast as a special case of gpmem: regression on a dataset D (with keys x and values y = D[x]) is the same as statistical emulation of the function f defined in pseudocode as

def f(x):
  if x in D:
    return D[x]
  else:
    raise Exception("Illegal input")

Observing data points {(x, y)} is the same as probing the value of f at all the xs.

If you can think of any other potential applications of gpmem, we would love to hear them.

Bayesian Optimization Example

The example below showcases several nice features of gpmem and GP modeling in VentureScript. In the code below we aim to find the maximum value of our expensive function V on the interval [-20, 20] using as few evaluations of V as possible. We do this by wrapping V in a gpmem and repeatedly using the GP-based emulator to find a point where we have the most evidence than V will be large. We probe V at said point, which gives our GP model more information to choose the next probe point. In addition, we perform Metropolis-Hastings inference on the hyperparameters of the covariance function after each probe (gp-6.vnts).

clear
run(load_plugin("gpexample_plugin.py"));

// The target
assume V = make_audited_expensive_function("V");

// The GP memoizer
assume zero_func = gp_mean_const(0.0);
assume V_sf = tag(quote(hyper), 0, uniform_continuous(0, 10));
assume V_l = tag(quote(hyper), 1, uniform_continuous(0, 10));
assume V_l2 = V_l*V_l;
assume V_se = gp_cov_scale(V_sf, gp_cov_se(V_l2));
assume V_package = gpmem(V, zero_func, V_se);
assume V_probe = first(V_package);
assume V_emu = second(V_package);

// A very naive estimate of the argmax of the given function
define mc_argmax = proc(func) {
  candidate_xs = mapv(proc(i) {uniform_continuous(-20, 20)},
                      arange(20));
  candidate_ys = mapv(func, candidate_xs);
  lookup(candidate_xs, argmax_of_array(candidate_ys))
};

// Shortcut to sample the emulator at a single point without packing
// and unpacking arrays
define V_emu_pointwise = proc(x) {
    run(sample lookup(V_emu(array(unquote(x))), 0))
};

// Main inference loop
infer repeat(15, do(pass,
    // Probe V at the point mc_argmax(V_emu_pointwise)
    predict(V_probe(unquote(mc_argmax(V_emu_pointwise)))),
    // Infer hyperparameters
    mh(quote(hyper), one, 50)));

Here is a visualization of a run of this program (not the same run as above), showing the state of the emulator at various times. Each row adds a new probe point. The left column is before hyperparameter inference, the right column after. The blue curve is the true function V, the green circle is the next chosen probe point, and the purple square is the best probe point found so far. Notice that probes tend to be chosen either where the value of the emulator is high, or where the emulator has high uncertainty. Also note that the "crazy" plots on the third-fourth rows correspond to a "crazy" choice of parameters. This is not a mistake -- Metropolis-Hastings samplers will make crazy choices every once in a while (as will the true posterior, with some small probability).

Visualization of Bayesian optimization

< Prev: More VentureScript: Linear Regression ^ Top