< Prev: Basic VentureScript ^ Top Next: Advanced Curve Fitting: Mixture Models and Gaussian Processes >

More VentureScript: Linear Regression

In this chapter, we will build up a somewhat more elaborate model, and explore strategies for inspecting and debugging it. The problem we address is linear regression: trying to infer a linear relationship between an input and an output from some data. This is of course well-trodden terrain -- we are primarily using it to introduce more VentureScript concepts, but the models themselves will get interesting toward the end.

Contents

Recap

In the previous part we discussed basic concepts of VentureScript programming, modeling, and inference. As a reminder, the main modes of using VentureScript we saw are starting Venture interactively

$ venture

and running standalone VentureScript scripts

$ venture -f script.vnts

Simple Linear Regression

Let's start by putting down the first model for linear regression that comes to mind. A line has a slope and an intercept

assume intercept = normal(0,10);
assume slope = normal(0,10)
10.8420009315

and is reasonable to represent as a function from the independent to the dependent variable

assume line = proc(x) { slope * x + intercept }
[]

In VentureScript, proc is the syntax for making an anonymous function with the given formal parameters and behavior, and we use assume to give it a name in our model.

Finally, we will model our data observations as subject to Gaussian noise, so we capture that by defining an observation function

assume obs = proc(x) { normal(line(x), 1) }
[]

If you don't want to type that, you can download lin-reg-1.vnts and use it with

$ venture -f lin-reg-1.vnts --interactive

Let's give this model a smoke test

// A tiny data set
observe obs(1) = 2;
observe obs(2) = 2;

// Should be solvable exactly
infer conditional;

// Do we get plausible values?
sample list(intercept, slope, line(3))
[1.9058282505054147, -0.12437887733785705, 1.5326916184918435]

Note that we are using rejection sampling here (infer conditional). This is fine for this example, but as explored previously, rejection sampling can become very slow as you increase the difficulty of the problem (such as by adding more data points, as we will do shortly).

Exercise: For now, evaluate the model's performance on the two data points.

  1. Gather and save a dataset of independent posterior samples of the parameters and a prediction at some value. [Hint: You can pass multiple arguments to collect] (show answer)

    Answer:

    define data = run(accumulate_dataset(100,
     do(conditional,
        collect(intercept, slope, line(3)))))
    
    <venture.engine.inference.Dataset object at 0x2b7b107c4e50>
  2. Predictive accuracy: Where do you expect the predictive mean to be, based on the observations we trained the model on? How widely do you expect it be dispersed, based on the noise in the model? Plot the predictions: does the data agree with your intuition? (show answer)

    Answer: 2, pretty broad,

    plot("h2", data)
    
    []

    A Venture plot

  3. Parameter inference

    a. Where do you expect the intercept to be located, and how concentrated do you expect it to be? Plot the intercept and check your intuition. (show answer)

    Answer: 2, broad,

    plot("h0", data)
    
    []

    A Venture plot

    b. Where do you expect the slope to be located, and how concentrated do you expect it to be? Plot the slope and check your intuition. (show answer)

    Answer: 0, broad,

    plot("h1", data)
    
    []

    A Venture plot

    c. What do you expect the relationship between the slope and intercept to be? Plot them against each other and check your intuition. Hint: The usual way to get a scatter plot is p0d1d: p for point and d for direct as opposed to log scale. A gotcha: if you leave the middle d off, the 0 and 1 will run together into a single index. (show answer)

    Answer: The slope and intercept are anticorrelated, because the line is trying to pass through around (1.5, 2).

    plot("p0d1d", data)
    
    []

    A Venture plot

Topic: Working with Datasets

In the previous segment, we started with the simple linear regression model in lin-reg-1.vnts

$ venture -f lin-reg-1.vnts --interactive

In this segment, we will try it on more data.

There is a little synthetic regression dataset in the file reg-test.csv. Download it and have a look.

You could, of course, type all this data into your Venture session as

observe obs(this) = that;
observe obs(that) = the other;
...

but that's pretty tedious. Being a probabilistic programming language, VentureScript doesn't replicate standard programming facilities like reading data files, and instead offers close integration with Python. So the way to do a job like this is to use <code>pyexec</code>:

run(pyexec("import pandas as pd"));
run(pyexec("frame = pd.read_csv('reg-test.csv')"));
run(pyexec("ripl.observe_dataset('obs', frame[['x','y']].values)"))
[]
list_directives
35: intercept: assume intercept = normal(0.0, 10.0):  4.00723528151
37: slope: assume slope = normal(0.0, 10.0):    -0.542008361467
39: line: assume line = make_csp(quote(x()), quote(add(mul(slope, x), intercept))): <procedure>
41: obs: assume obs = make_csp(quote(x()), quote(normal(line(x), 1.0))):    <procedure>
43: observe obs(1.0) = 2.0
45: observe obs(2.0) = 2.0
357: observe obs(quote(-10.0)) = -1.36911818981
358: observe obs(quote(-9.0)) = -1.37284993346
359: observe obs(quote(-8.0)) = -1.36589951787
360: observe obs(quote(-7.0)) = 0.0322288526582
361: observe obs(quote(-6.0)) = 0.153331662166
362: observe obs(quote(-5.0)) = 0.524923866479
363: observe obs(quote(-4.0)) = 1.01300353852
364: observe obs(quote(-3.0)) = 1.15044755258
365: observe obs(quote(-2.0)) = 2.51616625277
366: observe obs(quote(-1.0)) = 2.54757544339
367: observe obs(quote(0.0)) = 2.17524586269
368: observe obs(quote(1.0)) = 2.63727805696
369: observe obs(quote(2.0)) = 3.48041084204
370: observe obs(quote(3.0)) = 3.89578160414
371: observe obs(quote(4.0)) = 4.15055641543
372: observe obs(quote(5.0)) = 3.82503127393
373: observe obs(quote(6.0)) = 5.13278005841
374: observe obs(quote(7.0)) = 4.76360139589
375: observe obs(quote(8.0)) = 5.71061853907
376: observe obs(quote(9.0)) = 6.16338309941
377: observe obs(quote(10.0)) = 6.55698505098

If you want to know what just happened there, pyexec is an action of the inference language that executes the given string of Python code. The binding environment persists across calls, and begins with the variable ripl bound to the current session's Read-Infer-Predict Layer, which serves as the entry point for using VentureScript as a Python library. In this case, we used the escape to load our csv file up into a Pandas DataFrame, and then used the RIPL's <code>observe_dataset</code> method to load it into VentureScript as individual observations.

OK, so we loaded the data. We could do some inference

infer default_markov_chain(30)
[81.5]

and ask for a prediction at a new x-value

sample line(11)
-4.25118253977

but without a look at the data, we don't even know what to expect. There are of course plenty of ways to plot a dataset; in this case, we will use VentureScript's plot:

run(pyexec("import venture.lite.value as vv"));
define frame = run(pyeval("vv.VentureForeignBlob(frame[['x', 'y']])"));
plot("p0d1d", frame)
[]

A Venture plot

The dataset exhibits a clear linear relationship, and we should expect line(11) to be around 7.

Let's try using Markov chain inference to fit our linear model. Start 100 independent chains, evolve them for 30 iterations, and record where they end up:

define data = run(accumulate_dataset(100,
  do(reset_to_prior,
     default_markov_chain(30),
     collect(intercept, slope, line(11)))))
<venture.engine.inference.Dataset object at 0x2b7b18895d90>

Now we can plot a histogram of the predicted line(11) value.

plot("h2", data)
[]

A Venture plot

Topic: Understanding a model with visualization

Is our fit any good? Although we can get some idea of what's going on by plotting the predictions for a given x, it's much more effective to be able to see the fitted line plotted alongside the data, and see what happens over the course of inference.

In order to do so, we can make use of Venture's plugin infrastructure to load external code written in Python. In this case, we've provided a custom plugin that produces an animated curve-fitting visualization that can be updated during the inference program. Download curve-fitting.py and load it into VentureScript using

run(load_plugin("curve-fitting.py"))
[]

Alternatively, the plugin can be loaded when you launch Venture by supplying -L curve-fitting.py on the command line.

When the plugin loads, a new window should pop up where the animation will appear. The plugin provides a callback that can be invoked to draw a new frame

run(call_back(draw))
[]

which should render a plot that looks like this:

A frame of the visualization

The blue line is the current guess. The green points are the observations; the error bars around them reach twice the standard deviation of the noise, above and below. The number at the top left is the count of how many frames have been drawn.

We can also plot multiple frames with inference commands interleaved

infer reset_to_prior;
infer repeat(30, do(default_markov_chain(1),
                    call_back(draw)))
[]

The <code>repeat</code> command works similarly to accumulate_dataset, but in this case we aren't concerned with collecting the data since it's being rendered in the visualization instead. Together, this command will run the Markov chain for 30 iterations, drawing an animation frame after each iteration.

Another useful command which works well with visualization is <code>loop</code>, which runs the given action continually in the background. You can use it like this

infer loop(do(default_markov_chain(5),
              call_back(draw)))
None

which will iterate the Markov chain and draw an animation frame every 5 iterations, continuing to do so indefinitely. You can turn it off with

endloop
<procedure>

Exercise: Play with this visualization to see how inference is doing.

  1. Use the loop command to run the Markov chain continually and draw an animation frame after every iteration. Watch the animation while it runs and confirm that the line eventually approaches a good fit to the data. Does it take gradual steps or sudden jumps? (show answer)

    Answer: It takes sudden jumps, with varying amount of time between jumps.

    infer loop(do(default_markov_chain(1), call_back(draw)))
    
    None
  2. How many iterations does it take for the chain to converge? Modify the inference loop to call reset_to_prior after every 30 iterations of the Markov chain. (Hint: You may want to use repeat and do.) Is the Markov chain able to find a good fit within 30 iterations? What about 100 iterations? Try playing with the number of iterations to see how many it takes to reliably find a reasonable fit. (show answer)

    Answer: 30 is not enough, and it may not even be obvious visually when it's resetting to the prior. 100 is better, but still often fails to find a fit; 200-300 is clearly able to find it most of the time.

    infer loop(do(reset_to_prior,
                 repeat(30, do(default_markov_chain(1),
                               call_back(draw)))))
    
    None
  3. (If you're still running the infer loop, turn it off for this part.)

    endloop
    
    <procedure>

    We can use resample to run multiple independent chains and plot all of the resulting lines in the same window, to get some sense of the posterior distribution over lines. Resample 20 particles (remember to reset them to the prior), run them for 30 iterations, and look at the resulting animation. Does it look like the distribution of lines converges to something plausible? If you're impatient, try increasing the number of iterations as above. (show answer)

    Answer: Some of the samples do concentrate around the correct line, but again 30 iterations is not enough to converge.

    infer resample(20);
    infer do(reset_to_prior,
            repeat(30, do(default_markov_chain(1),
                          call_back(draw))));
    infer resample(1);
    
    []
  4. Inference alternatives

    a. In the optional subunit at the end of the previous chapter, we saw how to construct our own Markov chains that used custom transition operators, with a drift kernel example. Experiment with using a drift kernel for this problem. How quickly can you get it to converge by choosing a good proposal distribution? (show answer)

    Here is an example program using a Cauchy proposal kernel (you may have used something different):

    ``` define driftkernel = proc(x) { studentt(1, x, 1) };

    define driftmarkovchain = mhcorrect( onsubproblem(default, all, symmetriclocalproposal(drift_kernel)));

    infer do(resettoprior, repeat(30, do(driftmarkovchain, call_back(draw)))) ```

    []

    The Cauchy distribution has heavy tails, which means that this will usually take small steps but occasionally propose large steps. This seems to work fairly well here.

    b. As an alternative to Markov chains, we can use gradient ascent to find the best fit. Venture provides an inference primitive, <code>grad_ascent</code> which deterministically climbs the gradient toward the most probable parameter values. It can be invoked as follows:

    infer repeat(30, do(grad_ascent(default, all, 0.001, 1, 1), call_back(draw)))

    []

    This has a very different character than the inference primitives we've seen so far. Experiment with it, tweaking the step size (the "0.001"; be careful setting it too high!) and using reset_to_prior to start from a few different initial conditions. Can you think of some advantages and disadvantages of gradient methods compared to Monte Carlo methods? (show answer)

    Answer: Gradient ascent can move toward high-probability regions of the parameter space faster and more directly than Monte Carlo methods. However, the convergence of gradient ascent highly depends on the shape of the posterior probability density function; if the posterior density is too flat, it will move very slowly, whereas if the posterior density is too steep relative to the step size, it may overshoot and spiral off to infinity. Additionally, if there are local maxima, gradient ascent will not be able to escape them. (Markov chains are also susceptible to local maxima, although less so.)

    Convergence aside, an important conceptual difference is that gradient ascent gives only a point estimate of the single most probable values. In contrast, sampling methods return values that are (approximately) distributed according to the full posterior distribution, allowing the uncertainty in those values to be quantified and taken into account when making decisions.

Elaboration: Outlier Detection

So far, we haven't really done anything beyond what standard tools for linear least squares regression can do. Now we will look at what happens if the data may contain outliers: points that do not fit in with the rest. In ordinary least squares regression, outliers can disproportionally influence the result and so are often excluded before running the regression. Here, we will take a different approach by extending our probabilistic model to take outliers into account.

clear

We start with our linear model much the same as before

run(load_plugin("curve-fitting.py"));
assume intercept = normal(0,7);
assume slope = normal(0,4);

assume line = proc(x) { slope * x + intercept }
[]

but this time, each point has some probability of being an outlier.

assume outlier_prob = uniform_continuous(0.01,0.3);
assume is_outlier = mem(proc(i) { flip(outlier_prob) })
[]

(Here we use <code>mem</code> to create an indexed set of random variables is_outlier(i) for each i, which indicate whether the i-th point is an outlier.)

What does it mean for a point to be an outlier? Unlike most of the other points, which we can expect to be near the value predicted by our linear equation, an outlier could just as likely be anywhere. We will model that by saying that for an outlier point, the y-value is drawn from normal(0,20).

assume obs_outlier = proc(x) { normal(0,20) }
[]

Non-outlier points, as before, are centered on the line but subject to some Gaussian noise. This time, rather than assuming a fixed amount of noise, we will put a prior over the amount of noise as well and see how well we can infer it:

assume noise = gamma(1,1);
assume obs_inlier = proc(x) { normal(line(x), noise) }
[]

Putting that all together, our full data model is

assume obs = proc(i,x) {
  if (is_outlier(i)) {
    obs_outlier(x)
  } else {
    obs_inlier(x)
  }
}
[]

Whew. If you don't want to type all that in, it's also in lin-reg-2.vnts.

Let's try this with some data

observe obs(0,1) = 2.5;
observe obs(1,4) = 1;
observe obs(2,-3) = -1;
observe obs(3,-5) = 6
[-14481.585663694754]

and load it into our visualization

run(load_plugin("curve-fitting.py"));
infer loop(do(default_markov_chain(1),
              call_back(draw)))
None

This should spend a while considering different hypotheses and whether or not to treat each point as an outlier. Eventually, it might find this configuration and settle on it for a while:

Most favorable configuration of the outlier model

The grey dot is currently assumed to be an outlier, whereas the green ones are inliers.

endloop
<procedure>

If you are used to least squares regression, the above model may look a bit weird. A search process that was trying to minimize, say, squared error on inliers would be very happy to just call every data point an outlier and be done with it. Why doesn't the Markov chain decide to do that?

This is an instance of the phenomenon known as Bayesian Occam's Razor. On this example, it works as follows: Since we are being fully Bayesian here, we have to give a probability model for the outliers as well as the inliers. Since it's a probability model, its total mass per data point must add up to 1, just like the inlier model's total mass per data point. Since the point of the outlier model is to assign reasonable probability to a broad range of points (outliers!), it perforce cannot assign very high mass to any of them -- necessarily lower than the mass the inlier model assigns to points sufficiently near the line. Consequently, several nearly colinear points are much more likely to be inliers on a line than to all be outliers; and the Markov chain, being a correct algorithm, finds this. Philosophically, one might call the outlier model "complex", in the sense of being high entropy; and Bayesian Occam's Razor automatically causes the posterior to prefer lower entropy -- "simpler" -- explanations.

Topic: Using tags for inference control

Now that we have a more complex model with more moving parts, we'd like to be able to treat each of those parts independently when it comes to deciding how to do inference. VentureScript provides a mechanism to tag different parts of the model program and then refer to those tags from inference commands.

Let's see how this works. We'll load the same model as before, but this time we'll tag each of the random choices made in the model so that we can refer to them separately (lin-reg-3.vnts):

clear
run(load_plugin("curve-fitting.py"));
assume intercept = tag("param", "intercept", normal(0,7));
assume slope = tag("param", "slope", normal(0,4));

assume line = proc(x) { slope * x + intercept };

assume noise = tag("hyper", "noise", gamma(1, 1));

assume outlier_prob = tag("hyper", "outlier_prob", uniform_continuous(0.01, 0.3));

assume is_outlier = mem(proc(i) { tag("outlier", i, flip(outlier_prob)) });

assume obs = proc(i, x) {
  if (is_outlier(i)) {
    normal(0,20)
  } else {
    normal(line(x), noise)
  }
};
[]

Each tag command takes three arguments: a scope identifier, a block identifier, and the expression it applies to. Scope and block identifiers can be any unique expression. Inference commands can then use these identifiers to target a particular scope and block. For example, we can do Markov chain inference on just the line parameters using

infer mh("param", one, 1);
[2.0]

The mh inference command is the same as default_markov_chain, except that it allows you to specify a scope and block to operate on; in fact, default_markov_chain(x) is an alias for mh(default, one, x). Here default is a keyword that refers to the default scope containing all random choices in the model, and one means to pick one randomly from the blocks associated with the scope.

What are scoped inference commands good for? Here are some use cases:

Let's apply these ideas to our model. Load the observations again

observe obs(0,1) = 2.5;
observe obs(1,4) = 1;
observe obs(2,-3) = -1;
observe obs(3,-5) = 6
[-6171.75383525053]

and run the following inference program:

define drift_kernel = proc(x) { student_t(1, x, 1) };

define drift_mh =
  proc(scope, block, transitions) {
    repeat(transitions,
      mh_correct(
       on_subproblem(scope, block,
         symmetric_local_proposal(drift_kernel))))
  };

infer loop(do(drift_mh("param", all, 2),
              mh("hyper", one, 1),
              gibbs("outlier", one, 1),
              call_back(draw)))
None

This will alternate between the different tagged parts of the model, using a different inference strategy for each one.

Exercise: Compare this to the infer loop that only used default_markov_chain from earlier. What qualitative differences in behavior do you see? Try adjusting the number of iterations allocated to each scope and see what happens. (show answer)

Answer: The line moves more often, because the program spends less time on uselessly reconsidering whether points are outliers and more time on proposing new line parameters.

Giving more iterations to the line parameters (say, mh("param", one, 5)) will cause the line to move even more, but it may miss the opportunity to reduce the noise level or flag a point as an outlier when it doesn't fit.

Giving more iterations to the noise hyperparameter or the outlier detection will make these opportunities more likely, but may also waste time reconsidering the same random choices over and over just like default_markov_chain does.

Exercise: What happens if you replace drift_mh with grad_ascent? Hint: Make sure to pass the appropriate scope and block to grad_ascent. (show answer)

The line now moves smoothly toward the points that are marked as non-outliers. On the flip side, it is no longer able to explore alternative hypothesis once it has settled on a (possibly suboptimal) fit.

infer loop(do(grad_ascent("param", all, 0.001, 10, 1),
              mh("hyper", one, 1),
              gibbs("outlier", one, 1),
              call_back(draw)))
None
endloop
<procedure>

< Prev: Basic VentureScript ^ Top Next: Advanced Curve Fitting: Mixture Models and Gaussian Processes >