Basic VentureScript

In this chaper, we will go through the basic, foundational concepts in VentureScript: sampling, models, and automatic inference. The practice model will intentionally be very simple, so that it's possible to understand very clearly what's happening.

Basic VentureScript
Getting Started
First Steps
Basic Modeling
Better Inspection
Inference Choices

Getting Started

Start a Venture session by running

$ venture

venture[script] >

This command starts an interactive interpreter for VentureScript. In general, this tutorial will show interactions in grey blocks as above, with the user input in roman font, and the system's output in italics.

You can also run VentureScript scripts by making a file with a .vnts extension and running

$ venture -f script.vnts

N.B.: The command line reference explains what's happening here and what other options venture accepts.

First Steps

Sample an expression

uniform_continuous(0, 1)

0.903757171215

These are independent samples from the unit uniform distribution

uniform_continuous(0, 1)

0.0383357226451

We can introduce a variable:

define mu = expon(2)  // Exponential distribution with rate 2

0.027749186781

whose value is a particular sample from that distribution:

mu

0.027749186781

and we can sample further expressions conditioned on that value:

normal(mu, 0.2)

-0.0440719903108

Exercise: Convince yourself that, for any given mu, normal(mu, 0.2) is a different distribution from

normal(expon(2), 0.2)

-0.0718994784776

because the latter samples a new internal mu every time. Drawing just a few samples from each should suffice.

You can find the list of distributions (and deterministic functions) available in VentureScript in the appropriate section of the reference manual.

Now clear your session (or quit and restart Venture) for the next segment:

clear

Basic Modeling

When we used define to make a mu variable in the previous segment, that variable's value was fixed to a particular sample from its generating distribution for the remainder of the program.

In order to make variables that we can learn things about by observation and reasoning, we need to flag them as uncertain, using assume:

assume x = normal(0, 1)

0.361778084149

Now x is a normally distributed random variable, and its current value is a sample from that distribution.

Assumed and defined variables live in separate namespaces in VentureScript: define goes into the toplevel namespace, and assume goes in the model namespace. Bare expressions are evaluated in the toplevel namespace, so just typing x gives an error:

*** evaluation: Cannot find symbol 'x'
(autorun x)
         ^

To access the current value of an assumed variable, we use sample

sample x

0.361778084149

because sample runs its expression in the model namespace (without altering the content of the model).

Note that repeated access to x gives the same value each time:

sample x + x

0.723556168298

That's what allows us to form models with conditional dependencies, such as this extension:

assume y = normal(x, 1)

-0.489599284644

So, why is assume any more useful than define? Models, unlike the top level, allow us to register observations:

observe y = 2

sample y

2.0

Note: the value of x doesn't change yet --

sample x

0.361778084149

-- all we did was register an observation.

Now infer.

infer posterior

[]

Now x is a sample from the posterior distribution p(x|y=2) defined by our assumptions and observations:

sample x

-0.263557324474

If we do no more inference, x doesn't change.

sample x

-0.263557324474

But we can compute a new posterior sample by running the posterior inferrer again.

infer posterior;
sample x

-0.0061663297103

Better Inspection

As a reminder, we were studying this model (you can skip this block if you have your session from the last segment):

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 2

We can get samples from inference

infer posterior;
sample x

0.678012003634

but are they actually from the right distribution? We could eyeball it like this a bit more:

infer posterior;
sample x

1.76165725869

infer posterior;
sample x

0.80397067822

but that will get old quickly.

It's better to gather data programmatically, for example like this:

define data = run(accumulate_dataset(10, do(posterior, collect(x))))

<venture.engine.inference.Dataset object at 0x2b9d8e6d13d0>

If you really want to know what's going on in this expression you can study the documentation of do, run, accumulate_dataset, and collect. The effect is to gather up the values of x (and some metadata) over 10 iterations of the posterior inference command.

print(data)

   iter  prt. id  time (s)  log score  prt. log wgt.  prt. prob.         x
0     1        0  0.043709  -2.837982      -8.100795           1  0.989742
1     2        0  0.047718  -2.976821      -8.100795           1  1.372752
2     3        0  0.049731  -3.848979      -8.100795           1  2.005536
3     4        0  0.051427  -2.896388      -8.100795           1  1.241890
4     5        0  0.053405  -2.980818      -8.100795           1  1.378076
5     6        0  0.056276  -2.861042      -8.100795           1  1.152199
6     7        0  0.058131  -4.391345      -8.100795           1  2.246382
7     8        0  0.062418  -3.057262      -8.100795           1  0.531615
8     9        0  0.065270  -4.404635      -8.100795           1 -0.251702
9    10        0  0.067542  -3.965679      -8.100795           1  2.061980

[10 rows x 7 columns]

[]

Do those values of x look good? The values look sane to me, but I can't eyeball a table of numbers and tell whether they are distributed properly. Let's get more data and plot it:

define more_data = run(accumulate_dataset(100, do(posterior, collect(x))))

<venture.engine.inference.Dataset object at 0x2b9d8e7282d0>

plot("h0", more_data)

[]

A Venture plot

That h0 says "make a histogram of the zeroth collected data stream". You will eventually want to study the plot spec reference, because that explains how to plot any of the values or metadata collected in the dataset on the x, y, or color axis of a plot.

Here's a version with more samples for better resolution:

plot("h0", run(accumulate_dataset(500, do(posterior, collect(x)))))

[]

A Venture plot

The analytical answer is a Gaussian with mean 1 and precision 2 (which is standard deviation 1/sqrt(2)), whose histogram looks like this:

A Gaussian with a lot of samples

Exercise: How visually close can you get to this plot by increasing the number of iterations (the 500) in the above expression? Is inference working? Are you satisfied with how fast it is?

In the next segment, we will look at a few different inference strategies, which give different speed-accuracy tradeoffs.

Inference Choices

In this segment, we will look at a few different generic inference strategies that are built in to VentureScript and do some basic exploration of their speed-accuracy behavior. As a reminder, here is the model we are working with (you can skip this block if you have your session from the last segment):

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 2

If you're curious why we're looking at the performance of inference on such a simple model, there are a couple reasons:

There is an analytic solution, which means the correct result is available for comparison.
The Gaussian distribution is actually a pretty good proxy for general probability problems: the mean corresponds to "the true result" and the standard deviation corresponds to "the uncertainty" or "the error tolerance".
The position of the observation is a proxy for the "difficulty" of the problem: the further out the data are from the prior, the more inference needs to overcome in order to reach the right answer.
The one down side is that this problem in one-dimensional, so you need to exercise active skepticism around your geometric intuitions: try and cross-check the mental pictures you build for how they would work in 2, 3, 10, or 10,000 dimensions.

Rejection sampling does not solve all problems

The inference command posterior is global rejection sampling -- guessing and checking. As you might know or imagine, this is likely to get slow. In particular, let's fiddle with the position of the observation and see what happens as we change the model to make the problem harder.

If you like standalone scripts, save the following into a file named, say, exact.vnts

assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 2;
plot("h0",
 run(
  accumulate_dataset(100,
   do(posterior,
      collect(x)))))

and explore by editing it and rerunning it with

$ venture -f exact.vnts

If you prefer to fiddle interactively, VentureScript has a command named forget for removing parts of a model so they can be changed. First, check what's in the model already:

list_directives

29: [assume x (normal 0 1)]:  -1.64830416955
30: [assume y (normal x 1)]:    2.0
31: [observe y 2]

[Note: The above output is presented in a prefix notation syntax that represents how VentureScript actually parsed it.]

Then use forget to remove the existing observation, and replace it with one where the datum is farther out.

forget 31;
observe normal(x,1) = 4

list_directives

29: [assume x (normal 0 1)]:  -1.64830416955
30: [assume y (normal x 1)]:    2.0
32: [observe (normal x 1) 4]

The same solution still works

plot("h0", run(accumulate_dataset(100, do(posterior, collect(x)))))

[]

A Venture plot

but as you move the datum still further out (with more forget and observe commands), you can tell that more distant observations take longer to get samples for. Don't move the datum too far: rejection takes e^(O(position^2)) time on this problem.

The inference phenomenon at work here is this:

The KL divergence from the posterior to the prior is one notion of the "distance" between the posterior and the prior. Intuitively, we can think of this KL divergence as measuring the difficulty of the problem. In the current example, the KL divergence grows quadratically with position of the datum.
Rejection sampling (which is the algorithm the posterior command uses) is exponentially slow in the KL gap it is trying to cross. This is one case in which the intuitive notion that KL divergence measures the difficulty of inference has been made precise.

Importance Sampling with Resampling

A very common inference technique is importance sampling with resampling. It is:

Draw some number of trials (or "particles") from the prior [or any proposal distribution of your choice]
Weight them by the likelihood [or the ratio between the posterior and proposal probabilities]
Select one at random in proportion to its weight [or more than one, but then they are not independent of each other]

Here's what a basic version of this looks like in VentureScript. Reset the model if you need to

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 4

or load your file from the rejection segment into an interactive session with

venture -f exact.vnts -i

Then make 5 particles

infer resample(5)

[]

The resample inference action makes as many particles as requested by drawing from the existing pool proportionally to their weight, and then makes the weights equal (this compound operation is very common in sequential Monte Carlo or particle filter methods). Initially there is just one particle, so right now all of them are the same:

run(sample_all(x))

[0.28061261885259997, 0.28061261885259997, 0.28061261885259997, 0.28061261885259997, 0.28061261885259997]

To do importance sampling, we next reset them to the prior and weight by likelihood:

infer likelihood_weight();
run(sample_all(x))

[-1.086234889062641, 1.3623465562271404, -0.40785009488157614, -0.5804252295537723, 2.5681898258814106]

and then resample down to one, which picks a particle at random in proportion to its weight.

infer resample(1);
sample x

2.56818982588

Using the plotting tools we learned before, here's the distribution this algorithm makes (note the use of do to sequence inference actions):

plot("h0", run(accumulate_dataset(100,
  do(resample(5),
     likelihood_weight(),
     resample(1),
     collect(x)))))

[]

A Venture plot

What is going on here? If you put the observation at 4, then the answer should be a Gaussian with mean 2, but the distribution with 5 trials is not even close (though it has moved from the prior). This is our first approximate inference algorithm, and generates our first time-accuracy tradeoff.

Exercises:

Vary the location of the observation without varying the inference program. Confirm that the runtime does not change, but the quality of the result (relative to the true posterior) does. (show answer)

Answer: With the observation at y = 2

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 2;
plot("h0", run(accumulate_dataset(100,
  do(resample(5),
     likelihood_weight(),
     resample(1),
     collect(x)))))

[]

A Venture plot

and at y = 6

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 6;
plot("h0", run(accumulate_dataset(100,
  do(resample(5),
     likelihood_weight(),
     resample(1),
     collect(x)))))

[]

A Venture plot

Vary the number of trials (the argument to the first resample) for a fixed location of the datum. Confirm that the runtime is asymptotically linear in that, and that the result distribution approaches the true posterior as the number of trials increases. (show answer)

Answer: With 10 trials

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 4;
plot("h0", run(accumulate_dataset(100,
  do(resample(10),
     likelihood_weight(),
     resample(1),
     collect(x)))))

[]

A Venture plot

and with 50 trials

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 4;
plot("h0", run(accumulate_dataset(100,
  do(resample(50),
     likelihood_weight(),
     resample(1),
     collect(x)))))

[]

A Venture plot

What happens if we don't do the second resample at all? Why? (show answer)

Answer: We get samples from the prior, because we are not using the weights.
What happens if we draw more than 1 particle in the second resample? Why? (show answer)
Answer: This is more subtle.
- For a while, drawing more particles should give a better-resolution plot of the distribution, since we have more data points overall.
- Eventually, though, the fact that they are not independently distributed (but coupled through the original set) will begin to dominate, and the benefit will trail off.
- The quality of the approximation is limited by the distribution of particles from the first resample; if you only increase the number of particles in the second resample while leaving the first resample fixed, the result will not converge to the true posterior.

Basic Metropolis-Hastings Markov Chains

Both rejection sampling and importance sampling consist of guessing possible explanations of the data, without taking advantage of previous guesses when forming new ones. The other major class of inference algorithms tries to take advantage of what has already been learned by retaining a "current state" and (stochastically) evolving it by trying various adjustments in some way arranged to approach the posterior in distribution. Broadly, such algorithms are called Markov chains. (The Markovness is that the "current state" holds all the information about the chain's history that is relevant to determining its future evolution.)

VentureScript has extensive facilities for constructing Markov chain inference strategies. In this segment, we will explore the behavior of a simple generic one on the single-variable normal-normal problem we have been studying.

Reset the model if you need to

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 4;
sample x

0.566915801912

The inference command default_markov_chain(1) takes one step of VentureScript's default Markov chain, possibly changing the value of x.

infer default_markov_chain(1);
sample x

0.780106848349

If we do that a few more times,

infer default_markov_chain(1);
sample x

0.780106848349

infer default_markov_chain(1);
sample x

0.780106848349

infer default_markov_chain(1);
sample x

0.780106848349

infer default_markov_chain(1);
sample x

0.780106848349

we should see x moving (perhaps haltingly and non-monotonically) toward regions of high posterior mass.

(Aside: Internally, Venture's default Markov chain uses an instance of an algorithm template called Metropolis-Hastings, which examines random choices made during the program execution, samples new proposed values for those choices, and decides whether to accept a proposed value based on a ratio of likelihoods.)

As usual, let's see that with more data:

define chain_history = run(accumulate_dataset(500, do(default_markov_chain(1), collect(x))))

<venture.engine.inference.Dataset object at 0x2b9d95a49950>

plot("lc0", chain_history)

[]

A Venture plot

The lc0 says "make a line plot of the 0th collected expression vs the iteration count"; here's the reference.

That was one particular chain. Let's see how their histories vary. First reset the state to the prior

infer reset_to_prior

[]

and let's plot another one:

plot("lc0", run(accumulate_dataset(500, do(default_markov_chain(1), collect(x)))))

[]

A Venture plot

For more insight, let's plot several independent chains together. If we don't use the weights for anything, VentureScript's particles will serve to run several independent chains (the r in the plot specification means color the lines by particle id).

infer resample(20);
infer reset_to_prior;  // Reset after resampling to get independent initialization
plot("lc0r", run(accumulate_dataset(100, do(default_markov_chain(1), collect(x)))))

[]

A Venture plot

That gives a flavor for how the chains evolve to compute a solution to this problem. We can also more precisely characterize the distribution that obtains after taking some number of steps by plotting the end point rather than the history

infer resample(1);  // Back to 1 particle
plot("h0", run(accumulate_dataset(500,
   do(reset_to_prior,
      default_markov_chain(5),
      collect(x)))))

[]

A Venture plot

Notice that a 5-step chain does not suffice, on this problem, to get to the right answer in distribution. This our second approximation algorithm.

Exercises:

Vary the location of the observation without varying the inference program. Confirm that the runtime of a fixed-length chain does not change, but the quality of the result (relative to the true posterior) does. (show answer)

Answer: With the observation at y = 2

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 2;
plot("h0", run(accumulate_dataset(500,
  do(reset_to_prior,
     default_markov_chain(5),
     collect(x)))))

[]

A Venture plot

and at y = 6

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 6;
plot("h0", run(accumulate_dataset(500,
  do(reset_to_prior,
     default_markov_chain(5),
     collect(x)))))

[]

A Venture plot

Vary the number of transitions for a fixed location of the datum. Confirm that the runtime is asymptotically linear in that, and that the result distribution approaches the true posterior as the number of steps increases. (show answer)

Answer: With 10 transitions

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 4;
plot("h0", run(accumulate_dataset(500,
  do(reset_to_prior,
     default_markov_chain(10),
     collect(x)))))

[]

A Venture plot

and with 50 transitions

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 4;
plot("h0", run(accumulate_dataset(500,
  do(reset_to_prior,
     default_markov_chain(50),
     collect(x)))))

[]

A Venture plot

What happens if we don't reset_to_prior? (show answer)

Answer: Now the histogram includes successive states of one long Markov chain. Most of those data points will be individually distributed according to the posterior, but if the number of steps between them is too small, they will not be independent.
Increase both the location of the observation (making the problem harder) and the number of transitions (doing more work). How fast does the latter have to move to keep up with the former? What does the history of a long chain on a hard observation look like? (show answer)
Answer: The further the chain gets, the longer it will sit still before advancing.
```
clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 6;
plot("lc0", run(accumulate_dataset(500, do(default_markov_chain(1), collect(x)))))
```
```
[]
```
The reason this happens is that the default Markov chain proposes new candidate values from the prior, independently of the current state. If most of the prior mass has lower posterior probability than the current state, the chain will only rarely make progress. In this case, the probability of proposing a new state larger than the current one decreases exponentially, but at least it's not the squared exponential of rejection sampling.

Optional Subunit: Custom Metropolis-Hastings Proposals

Although the default Markov chain works well for a number of problems, and is noticeably more tractable than rejection sampling, it still suffers from the problem that it has difficulty proposing good moves when the posterior is far from the prior (see the last exercise in the previous section).

One approach that can improve performance on these problems is to take advantage of the fact that the current state of the Markov chain is probably better than an average sample from the prior. This suggests that a Markov chain that proposes transitions that are close to the current state may do better than one that proposes from the prior every time (as default_markov_chain does).

VentureScript has an experimental library of procedures for constructing Metropolis-Hastings proposals from user-supplied transition functions. The proposals generated by these procedures are automatically weighted in favor of more probable values so that the resulting Markov chain is probabilistically correct, in that it targets the true posterior as its stationary distribution.

We can use these features to implement custom Markov chain inference strategies. Let's revisit our model once again, this time with the observation quite far away from 0:

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 10;

Since the observation is so far from the prior, default_markov_chain will have a hard time drawing an x from the prior that is more likely than its current value. Instead, let's have our Markov chain propose values that are centered around the current value of x:

define drift_kernel = proc(x) { normal(x, 1) };

define my_markov_chain =
  mh_correct(
   on_subproblem(default, all,
     symmetric_local_proposal(drift_kernel)));

{'fields': (<venture.lite.sp.VentureSPRecord object at 0x2b9d8e3f1690>,), 'tag': 'inference_action', 'type': 'record'}

The important part of the above code snippet is drift_kernel, which is where we say that at each step of our Markov chain, we'd like to propose a transition by sampling a new state from a unit normal distribution whose mean is the current state. The rest of the code is scaffolding that applies the Metropolis-Hastings correction and turns it into an inference action that we can run; if you'd like to understand it in detail, refer to the documentation of mh_correct, on_subproblem, and symmetric_local_proposal.

Let's give our new Markov chain a whirl:

infer my_markov_chain;
sample x

2.57932657525

infer my_markov_chain;
sample x

2.57932657525

infer repeat(10, my_markov_chain);
sample x

5.8172762353

Looks promising... now let's plot the distribution it makes, as we have been doing:

plot("h0", run(accumulate_dataset(100,
   do(reset_to_prior,
      repeat(10, my_markov_chain),
      collect(x)))));

[]

A Venture plot

Exercises:

Vary the number of transitions in the above plot command. How many transitions does it take for the result distribution to get reasonably close to the posterior (which should be centered around 5 with standard deviation 1/sqrt(2))?

(show answer)
Answer: Shown below are plots after 30 and 60 transitions, respectively, which suggest that the chain has basically converged after 30 transitions.
```
plot("h0", run(accumulate_dataset(100,
  do(reset_to_prior,
     repeat(30, my_markov_chain),
     collect(x)))));
```
```
[]
```
```
plot("h0", run(accumulate_dataset(100,
  do(reset_to_prior,
     repeat(60, my_markov_chain),
     collect(x)))));
```
```
[]
```
Run a few independent chains for 500 transitions and plot the chain histories. Compare this with 500 transitions of the default Markov chain. (Remember to reset_to_prior before each of these.) Which one is more effective at moving toward and exploring the posterior? (show answer)
Answer: Here is our Markov chain
```
infer reset_to_prior;
plot("lc0", run(accumulate_dataset(500, do(my_markov_chain, collect(x)))))
```
```
[]
```
and here is default_markov_chain:
```
infer reset_to_prior;
plot("lc0", run(accumulate_dataset(500, do(default_markov_chain(1), collect(x)))))
```
```
[]
```
Notice that our Markov chain finds the posterior much more quickly and accepts many more transitions, while the default chain gets stuck for long periods of time without moving.
Can you think of any downsides of this approach? (Hint: are there any types of posterior distributions that might give it trouble?) (show answer)

Answer: If the posterior has multiple modes, drift proposals will tend to get stuck in one of the modes, since transitioning from one mode to another requires making a large step.
(Open-ended) Experiment with different transition kernels. What happens if you adjust the variance of drift_kernel? What if instead of a normal distribution you use a uniform distribution centered around the current state, or a heavy-tailed distribution like Student's t distribution? (Hint: uniform_continuous, student_t. Note that if you redefine drift_kernel, you'll have to re-paste the definition of my_markov_chain above.)

< Prev: Top

^ Top

Next: More VentureScript: Linear Regression >

Basic VentureScript

Contents

Getting Started

First Steps

Basic Modeling

Better Inspection

Inference Choices

Rejection sampling does not solve all problems

Importance Sampling with Resampling

Basic Metropolis-Hastings Markov Chains

Optional Subunit: Custom Metropolis-Hastings Proposals