Basic VentureScript

In this chaper, we will go through the basic, foundational concepts in VentureScript: sampling, models, and automatic inference. The practice model will intentionally be very simple, so that it's possible to understand very clearly what's happening.

Basic VentureScript
Getting Started
First Steps
Basic Modeling
Better Inspection
Inference Choices

Getting Started

Start a Venture session by running

$ venture

venture[script] >

This command starts an interactive interpreter for VentureScript. In general, this tutorial will show interactions in grey blocks as above, with the user input in roman font, and the system's output in italics.

You can also run VentureScript scripts by making a file with a .vnts extension and running

$ venture -f script.vnts

N.B.: The command line reference explains what's happening here and what other options venture accepts.

First Steps

Sample an expression

uniform_continuous(0, 1)

0.170949604174

These are independent samples from the unit uniform distribution

uniform_continuous(0, 1)

0.499669635632

We can introduce a variable:

define mu = expon(2)  // Exponential distribution with rate 2

0.991628049245

whose value is a particular sample from that distribution:

mu

0.991628049245

and we can sample further expressions conditioned on that value:

normal(mu, 0.2)

1.04669478244

Exercise: Convince yourself that, for any given mu, normal(mu, 0.2) is a different distribution from

normal(expon(2), 0.2)

0.0662439381403

because the latter samples a new internal mu every time. Drawing just a few samples from each should suffice.

You can find the list of distributions (and deterministic functions) available in VentureScript in the appropriate section of the reference manual.

Now clear your session (or quit and restart Venture) for the next segment:

clear

Basic Modeling

When we used define to make a mu variable in the previous segment, that variable's value was fixed to a particular sample from its generating distribution for the remainder of the program.

In order to make variables that we can learn things about by observation and reasoning, we need to flag them as uncertain, using assume:

assume x = normal(0, 1)

-1.33397071759

Now x is a normally distributed random variable, and its current value is a sample from that distribution.

Assumed and defined variables live in separate namespaces in VentureScript: define goes into the toplevel namespace, and assume goes in the model namespace. Bare expressions are evaluated in the toplevel namespace, so just typing x gives an error:

*** evaluation: Cannot find symbol 'x'
(autorun x)
         ^

To access the current value of an assumed variable, we use sample

sample x

-1.33397071759

because sample runs its expression in the model namespace (without altering the content of the model).

Note that repeated access to x gives the same value each time:

sample x + x

-2.66794143518

That's what allows us to form models with conditional dependencies, such as this extension:

assume y = normal(x, 1)

-2.2372525874

So, why is assume any more useful than define? Models, unlike the top level, allow us to register observations:

observe y = 2

[-6.476618906076881]

sample y

2.0

Note: the value of x doesn't change yet --

sample x

-1.33397071759

-- all we did was register an observation.

Now infer.

infer conditional

[2.0]

Now x is a sample from the conditional distribution p(x|y=2) defined by our assumptions and observations:

sample x

1.78362017087

If we do no more inference, x doesn't change.

sample x

1.78362017087

But we can compute a new conditional sample by running the conditional inferrer again.

infer conditional;
sample x

1.45623960677

Better Inspection

As a reminder, we were studying this model (you can skip this block if you have your session from the last segment):

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 2

[-2.377410879346039]

We can get samples from inference

infer conditional;
sample x

-0.66942482291

but are they actually from the right distribution? We could eyeball it like this a bit more:

infer conditional;
sample x

1.31132449577

infer conditional;
sample x

0.742505880247

but that will get old quickly.

It's better to gather data programmatically, for example like this:

define data = run(accumulate_dataset(10, do(conditional, collect(x))))

<venture.engine.inference.Dataset object at 0x2ab645548c90>

If you really want to know what's going on in this expression you can study the documentation of do, run, accumulate_dataset, and collect. The effect is to gather up the values of x (and some metadata) over 10 iterations of the conditional inference command.

print(data)

   iter  prt. id  time (s)  prt. log wgt.  prt. prob.         x
0     1        0  0.089539      -2.377411           1  1.051775
1     2        0  0.103293      -2.377411           1  0.673414
2     3        0  0.119450      -2.377411           1  0.382025
3     4        0  0.134838      -2.377411           1  0.842510
4     5        0  0.145807      -2.377411           1  2.174216
5     6        0  0.164832      -2.377411           1  2.393941
6     7        0  0.173962      -2.377411           1  1.353022
7     8        0  0.191129      -2.377411           1  0.999126
8     9        0  0.204629      -2.377411           1  1.462554
9    10        0  0.214515      -2.377411           1  1.260449

[]

Do those values of x look good? The values look sane to me, but I can't eyeball a table of numbers and tell whether they are distributed properly. Let's get more data and plot it:

define more_data = run(accumulate_dataset(100, do(conditional, collect(x))))

<venture.engine.inference.Dataset object at 0x2ab6455a0510>

plot("h0", more_data)

[]

A Venture plot

That h0 says "make a histogram of the zeroth collected data stream". You will eventually want to study the plot spec reference, because that explains how to plot any of the values or metadata collected in the dataset on the x, y, or color axis of a plot.

Here's a version with more samples for better resolution:

plot("h0", run(accumulate_dataset(500, do(conditional, collect(x)))))

[]

A Venture plot

The analytical answer is a Gaussian with mean 1 and precision 2 (which is standard deviation 1/sqrt(2)), whose histogram looks like this:

A Gaussian with a lot of samples

Exercise: How visually close can you get to this plot by increasing the number of iterations (the 500) in the above expression? Is inference working? Are you satisfied with how fast it is?

In the next segment, we will look at a few different inference strategies, which give different speed-accuracy tradeoffs.

Inference Choices

In this segment, we will look at a few different generic inference strategies that are built in to VentureScript and do some basic exploration of their speed-accuracy behavior. As a reminder, here is the model we are working with (you can skip this block if you have your session from the last segment):

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 2

[-6.814418318240326]

If you're curious why we're looking at the performance of inference on such a simple model, there are a couple reasons:

There is an analytic solution, which means the correct result is available for comparison.
The Gaussian distribution is actually a pretty good proxy for general probability problems: the mean corresponds to "the true result" and the standard deviation corresponds to "the uncertainty" or "the error tolerance".
The position of the observation is a proxy for the "difficulty" of the problem: the further out the data are from the prior, the more inference needs to overcome in order to reach the right answer.
The one down side is that this problem in one-dimensional, so you need to exercise active skepticism around your geometric intuitions: try and cross-check the mental pictures you build for how they would work in 2, 3, 10, or 10,000 dimensions.

Rejection sampling does not solve all problems

The inference command conditional is global rejection sampling -- guessing and checking. As you might know or imagine, this is likely to get slow. In particular, let's fiddle with the position of the observation and see what happens as we change the model to make the problem harder.

If you like standalone scripts, save the following into a file named, say, exact.vnts

assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 2;
plot("h0",
 run(
  accumulate_dataset(100,
   do(conditional,
      collect(x)))))

and explore by editing it and rerunning it with

$ venture -f exact.vnts

If you prefer to fiddle interactively, VentureScript has a command named forget for removing parts of a model so they can be changed. First, check what's in the model already:

list_directives

35: x: assume x = normal(0.0, 1.0):   -1.43379666988
37: y: assume y = normal(x, 1.0):   2.0
39: observe y = 2.0

[Note: The above output is presented in a prefix notation syntax that represents how VentureScript actually parsed it.]

Then use forget to remove the existing observation, and replace it with one where the datum is farther out.

forget(39);
observe normal(x,1) = 4

[-15.682011657998093]

list_directives

35: x: assume x = normal(0.0, 1.0):   -1.43379666988
37: y: assume y = normal(x, 1.0):   2.0
42: observe normal(x, 1.0) = 4.0

The same solution still works

plot("h0", run(accumulate_dataset(100, do(conditional, collect(x)))))

[]

A Venture plot

but as you move the datum still further out (with more forget and observe commands), you can tell that more distant observations take longer to get samples for. Don't move the datum too far: rejection takes e^(O(position^2)) time on this problem.

The inference phenomenon at work here is this:

The KL divergence from the conditional (e.g. posterior) to the marginal (e.g. prior) is one notion of the "distance" between the conditional and the marginal. Intuitively, we can think of this KL divergence as measuring the difficulty of the problem. In the current example, the KL divergence grows quadratically with position of the datum.
Rejection sampling (which is the algorithm the conditional command uses) is exponentially slow in the KL gap it is trying to cross. This is one case in which the intuitive notion that KL divergence measures the difficulty of inference has been made precise.

Importance Sampling with Resampling

A very common inference technique is importance sampling with resampling. It is:

Draw some number of trials (or "particles") from the prior [or any proposal distribution of your choice]
Weight them by the likelihood [or the ratio between the posterior and proposal probabilities]
Select one at random in proportion to its weight [or more than one, but then they are not independent of each other]

Here's what a basic version of this looks like in VentureScript. Reset the model if you need to

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 4

[-7.527818178110426]

or load your file from the rejection segment into an interactive session with

venture -f exact.vnts -i

Then make 5 particles

infer resample(5)

[]

The resample inference action makes as many particles as requested by drawing from the existing pool proportionally to their weight, and then makes the weights equal (this compound operation is very common in sequential Monte Carlo or particle filter methods). Initially there is just one particle, so right now all of them are the same:

run(sample_all(x))

[0.3643763547622939, 0.3643763547622939, 0.3643763547622939, 0.3643763547622939, 0.3643763547622939]

To do importance sampling, we next reset them to the prior and weight by likelihood:

infer likelihood_weight();
run(sample_all(x))

[-0.35555465658409324, -0.5476100214718841, 0.816813133309636, 2.0914841728867644, -0.09649421242218251]

and then resample down to one, which picks a particle at random in proportion to its weight.

infer resample(1);
sample x

2.09148417289

Using the plotting tools we learned before, here's the distribution this algorithm makes (note the use of do to sequence inference actions):

plot("h0", run(accumulate_dataset(100,
  do(resample(5),
     likelihood_weight(),
     resample(1),
     collect(x)))))

[]

A Venture plot

What is going on here? If you put the observation at 4, then the answer should be a Gaussian with mean 2, but the distribution with 5 trials is not even close (though it has moved from the prior). This is our first approximate inference algorithm, and generates our first time-accuracy tradeoff.

Exercises:

Vary the location of the observation without varying the inference program. Confirm that the runtime does not change, but the quality of the result (relative to the true posterior) does. (show answer)

Answer: With the observation at y = 2

``` clear; assume x = normal(0, 1); assume y = normal(x, 1); observe y = 2; plot("h0", run(accumulatedataset(100, do(resample(5), likelihoodweight(), resample(1), collect(x)))))

   <pre class="value"><code>[]</code></pre>
   ![A Venture plot](figure-4.png)


   and at y = 6

   ```
   clear;
   assume x = normal(0, 1);
   assume y = normal(x, 1);
   observe y = 6;
   plot("h0", run(accumulate_dataset(100,
      do(resample(5),
         likelihood_weight(),
         resample(1),
         collect(x)))))

[]

Vary the number of trials (the argument to the first resample) for a fixed location of the datum. Confirm that the runtime is asymptotically linear in that, and that the result distribution approaches the true posterior as the number of trials increases. (show answer)

Answer: With 10 trials

``` clear; assume x = normal(0, 1); assume y = normal(x, 1); observe y = 4; plot("h0", run(accumulatedataset(100, do(resample(10), likelihoodweight(), resample(1), collect(x)))))

   <pre class="value"><code>[]</code></pre>
   ![A Venture plot](figure-6.png)


   and with 50 trials

   ```
   clear;
   assume x = normal(0, 1);
   assume y = normal(x, 1);
   observe y = 4;
   plot("h0", run(accumulate_dataset(100,
      do(resample(50),
         likelihood_weight(),
         resample(1),
         collect(x)))))

[]

What happens if we don't do the second resample at all? Why? (show answer)

Answer: We get samples from the prior, because we are not using the weights.
What happens if we draw more than 1 particle in the second resample? Why? (show answer)
Answer: This is more subtle.
- For a while, drawing more particles should give a better-resolution plot of the distribution, since we have more data points overall.
- Eventually, though, the fact that they are not independently distributed (but coupled through the original set) will begin to dominate, and the benefit will trail off.
- The quality of the approximation is limited by the distribution of particles from the first resample; if you only increase the number of particles in the second resample while leaving the first resample fixed, the result will not converge to the true posterior.

Basic Metropolis-Hastings Markov Chains

Both rejection sampling and importance sampling consist of guessing possible explanations of the data, without taking advantage of previous guesses when forming new ones. The other major class of inference algorithms tries to take advantage of what has already been learned by retaining a "current state" and (stochastically) evolving it by trying various adjustments in some way arranged to approach the posterior in distribution. Broadly, such algorithms are called Markov chains. (The Markovness is that the "current state" holds all the information about the chain's history that is relevant to determining its future evolution.)

VentureScript has extensive facilities for constructing Markov chain inference strategies. In this segment, we will explore the behavior of a simple generic one on the single-variable normal-normal problem we have been studying.

Reset the model if you need to

clear;
assume x = normal(0, 1);
assume y = normal(x, 1);
observe y = 4;
sample x

0.933485826276

The inference command default_markov_chain(1) takes one step of VentureScript's default Markov chain, possibly changing the value of x.

infer default_markov_chain(1);
sample x

0.933485826276

If we do that a few more times,

infer default_markov_chain(1);
sample x

1.47922146876

infer default_markov_chain(1);
sample x

1.47922146876

infer default_markov_chain(1);
sample x

1.47922146876

infer default_markov_chain(1);
sample x

1.47922146876

we should see x moving (perhaps haltingly and non-monotonically) toward regions of high posterior mass.

(Aside: Internally, Venture's default Markov chain uses an instance of an algorithm template called Metropolis-Hastings, which examines random choices made during the program execution, samples new proposed values for those choices, and decides whether to accept a proposed value based on a ratio of likelihoods.)

As usual, let's see that with more data:

define chain_history = run(accumulate_dataset(500, do(default_markov_chain(1), collect(x))))

<venture.engine.inference.Dataset object at 0x2ab64fb5a410>

plot("lc0", chain_history)

[]

A Venture plot

The lc0 says "make a line plot of the 0th collected expression vs the iteration count"; here's the reference.

That was one particular chain. Let's see how their histories vary. First reset the state to the prior

infer reset_to_prior

[]

and let's plot another one:

plot("lc0", run(accumulate_dataset(500, do(default_markov_chain(1), collect(x)))))

[]

A Venture plot

For more insight, let's plot several independent chains together. If we don't use the weights for anything, VentureScript's particles will serve to run several independent chains (the r in the plot specification means color the lines by particle id).

infer resample(20);
infer reset_to_prior;  // Reset after resampling to get independent initialization
plot("lc0r", run(accumulate_dataset(100, do(default_markov_chain(1), collect(x)))))

[]

A Venture plot

That gives a flavor for how the chains evolve to compute a solution to this problem. We can also more precisely characterize the distribution that obtains after taking some number of steps by plotting the end point rather than the history

infer resample(1);  // Back to 1 particle
plot("h0", run(accumulate_dataset(500,
   do(reset_to_prior,
      default_markov_chain(5),
      collect(x)))))

[]

A Venture plot

Notice that a 5-step chain does not suffice, on this problem, to get to the right answer in distribution. This our second approximation algorithm.

Exercises:

Vary the location of the observation without varying the inference program. Confirm that the runtime of a fixed-length chain does not change, but the quality of the result (relative to the true posterior) does. (show answer)

Answer: With the observation at y = 2

``` clear; assume x = normal(0, 1); assume y = normal(x, 1); observe y = 2; plot("h0", run(accumulatedataset(500, do(resettoprior, defaultmarkov_chain(5), collect(x)))))

   <pre class="value"><code>[]</code></pre>
   ![A Venture plot](figure-12.png)


   and at y = 6

   ```
   clear;
   assume x = normal(0, 1);
   assume y = normal(x, 1);
   observe y = 6;
   plot("h0", run(accumulate_dataset(500,
      do(reset_to_prior,
         default_markov_chain(5),
         collect(x)))))

[]

Vary the number of transitions for a fixed location of the datum. Confirm that the runtime is asymptotically linear in that, and that the result distribution approaches the true posterior as the number of steps increases. (show answer)

Answer: With 10 transitions

``` clear; assume x = normal(0, 1); assume y = normal(x, 1); observe y = 4; plot("h0", run(accumulatedataset(500, do(resettoprior, defaultmarkov_chain(10), collect(x)))))

   <pre class="value"><code>[]</code></pre>
   ![A Venture plot](figure-14.png)


   and with 50 transitions

   ```
   clear;
   assume x = normal(0, 1);
   assume y = normal(x, 1);
   observe y = 4;
   plot("h0", run(accumulate_dataset(500,
      do(reset_to_prior,
         default_markov_chain(50),
         collect(x)))))

[]

What happens if we don't reset_to_prior? (show answer)

Answer: Now the histogram includes successive states of one long Markov chain. Most of those data points will be individually distributed according to the posterior, but if the number of steps between them is too small, they will not be independent.
Increase both the location of the observation (making the problem harder) and the number of transitions (doing more work). How fast does the latter have to move to keep up with the former? What does the history of a long chain on a hard observation look like? (show answer)

Answer: The further the chain gets, the longer it will sit still before advancing.

``` clear; assume x = normal(0, 1); assume y = normal(x, 1); observe y = 6; plot("lc0", run(accumulatedataset(500, do(defaultmarkov_chain(1), collect(x)))))

   <pre class="value"><code>[]</code></pre>
   ![A Venture plot](figure-16.png)


   The reason this happens is that the default Markov chain proposes
   new candidate values from the prior, independently of the current
   state.  If most of the prior mass has lower posterior probability
   than the current state, the chain will only rarely make
   progress. In this case, the probability of proposing a new state
   larger than the current one decreases exponentially, but at least
   it's not the squared exponential of rejection sampling.
   </div>

Optional Subunit: Custom Metropolis-Hastings Proposals
------------------------------------------------------

Although the default Markov chain works well for a number of problems,
and is noticeably more tractable than rejection sampling, it still
suffers from the problem that it has difficulty proposing good moves
when the posterior is far from the prior (see the last exercise in the
previous section).

One approach that can improve performance on these problems is to take
advantage of the fact that the current state of the Markov chain is
probably better than an average sample from the prior. This suggests
that a Markov chain that proposes transitions that are close to the
current state may do better than one that proposes from the prior
every time (as `default_markov_chain` does).

VentureScript has an experimental library of procedures for
constructing Metropolis-Hastings proposals from user-supplied
transition functions. The proposals generated by these procedures are
automatically weighted in favor of more probable values so that the
resulting Markov chain is probabilistically correct, in that it
targets the true posterior as its stationary distribution.

We can use these features to implement custom Markov chain inference
strategies. Let's revisit our model once again, this time with the
observation quite far away from 0:

clear; assume x = normal(0, 1); assume y = normal(x, 1); observe y = 10;

<pre class="value"><code>[-52.70184979167487]</code></pre>


Since the observation is so far from the prior, `default_markov_chain`
will have a hard time drawing an `x` from the prior that is more
likely than its current value.  Instead, let's have our Markov chain
propose values that are centered around the current value of `x`:

define drift_kernel = proc(x) { normal(x, 1) };

define mymarkovchain = mhcorrect( onsubproblem(default, all, symmetriclocalproposal(drift_kernel)));

<pre class="value"><code>OrderedDict([('type', 'record'), ('tag', 'inference_action'), ('fields', (&lt;venture.lite.sp.VentureSPRecord object at 0x2ab64d2dcd50&gt;,))])</code></pre>


The important part of the above code snippet is `drift_kernel`, which
is where we say that at each step of our Markov chain, we'd like to
propose a transition by sampling a new state from a unit normal
distribution whose mean is the current state. The rest of the code is
scaffolding that applies the Metropolis-Hastings correction and turns
it into an inference action that we can run; if you'd like to
understand it in detail, refer to the documentation of
[`mh_correct`](../../reference/inference.html#mh_correct),
[`on_subproblem`](../../reference/inference.html#on_subproblem), and
[`symmetric_local_proposal`](../../reference/inference.html#symmetric_local_proposal).

Let's give our new Markov chain a whirl:

infer mymarkovchain; sample x

<pre class="value"><code>-0.176729460732</code></pre>

infer mymarkovchain; sample x

<pre class="value"><code>0.215047220398</code></pre>

infer repeat(10, mymarkovchain); sample x

<pre class="value"><code>3.23681045227</code></pre>


Looks promising... now let's plot the distribution it makes, as we
have been doing:

plot("h0", run(accumulatedataset(100, do(resettoprior, repeat(10, mymarkov_chain), collect(x)))));

<pre class="value"><code>[]</code></pre>
![A Venture plot](figure-17.png)


Exercises:

1. Vary the number of transitions in the above plot command.  How many
   transitions does it take for the result distribution to get
   reasonably close to the posterior (which should be centered around
   5 with standard deviation `1/sqrt(2)`)?

   <a href="javascript:toggle('answer-custom-mh-1')">(show answer)</a>

   <div class="answer" id="answer-custom-mh-1" markdown="1">
   Answer: Shown below are plots after 30 and 60 transitions,
   respectively, which suggest that the chain has basically converged
   after 30 transitions.

   ```
   plot("h0", run(accumulate_dataset(100,
      do(reset_to_prior,
         repeat(30, my_markov_chain),
         collect(x)))));

[]

``` plot("h0", run(accumulatedataset(100, do(resettoprior, repeat(60, mymarkov_chain), collect(x)))));

   <pre class="value"><code>[]</code></pre>
   ![A Venture plot](figure-19.png)

   </div>

2. Run a few independent chains for 500 transitions and plot the chain
   histories.  Compare this with 500 transitions of the
   default Markov chain.  (Remember to `reset_to_prior` before each of
   these.)  Which one is more effective at moving toward and exploring
   the posterior?
   <a href="javascript:toggle('answer-custom-mh-2')">(show answer)</a>

   <div class="answer" id="answer-custom-mh-2" markdown="1">
   Answer: Here is our Markov chain

   ```
   infer reset_to_prior;
   plot("lc0", run(accumulate_dataset(500, do(my_markov_chain, collect(x)))))

[]

and here is default_markov_chain:

infer reset_to_prior; plot("lc0", run(accumulate_dataset(500, do(default_markov_chain(1), collect(x)))))

[]

Notice that our Markov chain finds the posterior much more quickly and accepts many more transitions, while the default chain gets stuck for long periods of time without moving.

Can you think of any downsides of this approach? (Hint: are there any types of posterior distributions that might give it trouble?) (show answer)

Answer: If the posterior has multiple modes, drift proposals will tend to get stuck in one of the modes, since transitioning from one mode to another requires making a large step.
(Open-ended) Experiment with different transition kernels. What happens if you adjust the variance of drift_kernel? What if instead of a normal distribution you use a uniform distribution centered around the current state, or a heavy-tailed distribution like Student's t distribution? (Hint: <code>uniform_continuous</code>, <code>student_t</code>. Note that if you redefine drift_kernel, you'll have to re-paste the definition of my_markov_chain above.)

< Prev: Top

^ Top

Next: More VentureScript: Linear Regression >

Basic VentureScript

Contents

Getting Started

First Steps

Basic Modeling

Better Inspection

Inference Choices

Rejection sampling does not solve all problems

Importance Sampling with Resampling

Basic Metropolis-Hastings Markov Chains