Programming and probability: Sampling from a discrete distribution over an infinite set

[[Originally posted at RandomSeed]]

This post is an introductory tutorial which presents a simple algorithm for sampling from a discrete probability distribution $p(k)$ defined over a countably infinite set. We also show how to use higher-order programming to implement the algorithm in a modular and reusable way, and how to amortize the cost of sampling using memoization.

Introduction and motivation

I was recently reading Djalil Chafi’s post on Generating Uniform Random Partitions, which describes an algorithm (originally due to Aart Johanes Stam) for sampling from the uniform law on $\Pi_n$, the set of all partitions of $\lbrace 1, 2, \dots, n \rbrace$. (Chafi’s post and Stam’s paper are both highly recommended.)

As a subroutine of the sampling algorithm described by Chafi, we need to generate a random positive integer $X$, which takes value $k$ with probability $p(k) := k^n/(k!eB_n)$. Here, $B_n$ is the $n^{\rm th}$ Bell number. Searching around, I found that the python package mpmath provides the bell routine which finds $B_n$ using Dobinski’s formula.

What I could not find was a standard routine for sampling from a discrete distribution over a countably infinite set. Several libraries such as numpy.random.choice in python and sample in R accept a probability vector $[p(x_1), \dots, p(x_m)]$ and return a random sample from that distribution. But these routines assume that we can represent the probability distribution as a list.

Problem Statement: How can we sample from a probability distribution $p(k)$ which is defined over all positive integers?

The finite case

As a quick review, let’s discuss the standard approach for the case that the distribution has finite support. Let $p$ be a discrete probability distribution over a set $\lbrace x_1, x_2, \dots, x_m \rbrace$ with $m$ items. To generate a sample $X \sim p$:

  1. Compute the probability vector $[p(x_1), \dots, p(x_m)]$.
  2. Compute the cumulative probability vector $[F(x_0), F(x_1), \dots, F(x_m)]$, where $F(x_i) := \sum_{j=1}^{i} p(x_j)$, and $F(x_0) := 0$.
  3. Divide the unit interval [0,1] into intervals $I_j := (F(x_{j-1}), F(x_{j})]$ $(j=1,\dots,n)$.
  4. Sample a uniform random number $U \sim [0,1]$.
  5. Find $j$ such that $U \in I_j$.
  6. Return $x_j$.

Exercise: Prove that this procedure returns a sample $x$ distributed according to $p$.

For more efficient algorithms in the finite case, see this post.

The infinite case

When $m$ is infinite, we cannot hope to construct the vector $[p(x_1), \dots, p(x_m)]$ in memory, or to partition $[0,1]$ into a finite set of intervals $I_j$. However, steps 3–6 are still correct even when $[0,1]$ is partitioned using countably infinitely many $I_j$.

For simplicity and without loss of generality, we define $x_k = k$ for $k = 1,2,\dots$. We assume as given a function pdf(k) which returns $p(k)$ for non-negative integers $k$, with the convention that pdf(0) = 0.

Using the example probability distribution from Chafi’s post, we write a python function make_pdf which takes an integer parameter n (indicating the set of partitions $\Pi_n$ from which we wish to sample) and returns the corresponding function $k \mapsto k^n/(k!eB_n)$. The rest of the code below implements the simulation procedure described in steps 3–6.

#!/usr/bin/env python
 
import numpy
 
from mpmath import bell
from mpmath import e
from mpmath import factorial
from mpmath import power
 
def make_pdf(n):
    """Return a Dobinski probability p(k) := k^n/(k!eB_n)."""
    def pdf(k):
        numer = power(k, n)
        denom = factorial(k) * e * bell(n)
        return numer / denom
    return pdf
 
def make_cdf(pdf):
    """Return cumulative probability function for pdf."""
    def cdf(k):
        return sum(pdf(j) for j in xrange(k+1))
    return cdf
 
def find_interval(u, cdf):
    """Find k such that u falls in I_k of given cdf."""
    k = 1
    while True:
        (left, right) = (cdf(k-1), cdf(k))
        if left < u <= right:
            return k
        k += 1
 
def simulate(cdf, rng):
    """Simulate from pdf using rng::numpy.random.RandomState."""
    u = rng.uniform()
    return find_interval(u, cdf)

In the code above, note that the only function that is specific to our target distribution $p$ is make_pdf, which returns the “Dobinski” probability density function with a fixed parameter n.

Also note that we wrote functions make_cdf, find_interval, and simulate are written modularly, since these are distinct pieces of computation. Moreover, these functions use higher-order programming, a programming pattern where functions manipulate other functions as values, and so they work with any valid pdf function.

Assuming the above code lives in a file called dobinsky.py, let’s generate 100 samples for n = 15 and measure how long it takes:

$ ipython -i dobinsky.py
>>> seed = 1
>>> n_bell = 15
>>> pdf_bell = make_pdf(n_bell)
>>> cdf_bell = make_cdf(pdf_bell)
>>> rng = numpy.random.RandomState(seed)
>>> %timeit -n1 -r1 print [simulate(cdf_bell, rng) for i in xrange(100)]
[5, 7, 2, 5, 4, 4, 5, 5, 5, 6, 5, 6, 5, 8, 3, 6, 5, 6, 4, 5, 7, 9, 5,
6, 8, 8, 4, 4, 4, 8, 4, 5, 8, 6, 6, 5, 6, 7, 3, 7, 10, 7, 5, 7, 4, 6,
8, 5, 5, 4, 3, 6, 5, 5, 6, 4, 6, 4, 6, 7, 4, 5, 6, 5, 4, 6, 6, 6, 8,
6, 8, 4, 4, 7, 5, 4, 8, 5, 7, 7, 8, 6, 7, 5, 5, 8, 5, 9, 6, 6, 4, 8,
6, 6, 5, 5, 8, 6, 3, 6]
1 loop, best of 1: 4.98 s per loop

It took around 4.98 seconds to generate 100 sample, or roughly 50 ms per sample. Can we improve the runtime of the sampler?

Reusing computation with memoization

Assuming that memory is plentiful and time is expensive (which is true in many scientific computing pipelines), we can speed up the simulation procedure using a technique called memoization, which is “an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.”

In our program, the key expensive computations are:

  1. pdf, which computes bell numbers, factorials, and powers.
  2. cdf, which takes linear sum over its input.
  3. find_interval, which finds the interval $k$ that a given sample $u$ falls in.

We will memoize pdf and cdf. Let us first write a higher order function memoize, which takes any function f of a single, hashable input, and returns its memoized version f_mem:

def memoize(f):
    cache = dict()
    def f_mem(k):
        if k not in cache:
            cache[k] = f(k)
        return cache[k]
    return f_mem

The returned function f_mem has in its own memory the cache dictionary. This programming pattern is known as a closure. Every time f_mem is invoked on k, it checks cache to potentially avoid computing f(k); if there is no hit, populates cache[k] with the result, and returns it.

Back to our ipython session, let’s see how much speedup we can obtain using memoize:

>>> pdf_bell_mem = memoize(pdf_bell)
>>> cdf_bell_mem = memoize(make_cdf(pdf_bell_mem))
>>> rng = numpy.random.RandomState(seed)
>>> %timeit -n1 -r1 print [simulate(cdf_bell_mem, rng) for _i in xrange(100)]
[5, 7, 2, 5, 4, 4, 5, 5, 5, 6, 5, 6, 5, 8, 3, 6, 5, 6, 4, 5, 7, 9, 5,
6, 8, 8, 4, 4, 4, 8, 4, 5, 8, 6, 6, 5, 6, 7, 3, 7, 10, 7, 5, 7, 4, 6,
8, 5, 5, 4, 3, 6, 5, 5, 6, 4, 6, 4, 6, 7, 4, 5, 6, 5, 4, 6, 6, 6, 8,
6, 8, 4, 4, 7, 5, 4, 8, 5, 7, 7, 8, 6, 7, 5, 5, 8, 5, 9, 6, 6, 4, 8,
6, 6, 5, 5, 8, 6, 3, 6]
1 loop, best of 1: 31 ms per loop

It took around 31 miliseconds to generate 100 simulations, or roughly 300 microseconds per sample. Heuristically, the speedup factor is ~160x, which is not too bad for a little bit work. Note that, we would expect the speedup factor to improve as we invoke the simulator more often and with greater simulations increases, since the caches become more populated.

>>> %timeit -n1 -r1 print [simulate(cdf_bell_mem, rng) for _i in xrange(100)]
[8, 6, 8, 6, 5, 6, 6, 6, 8, 8, 5, 9, 5, 4, 4, 6, 3, 8, 7, 3, 5, 5, 4,
7, 5, 8, 6, 8, 7, 8, 6, 6, 7, 5, 6, 6, 3, 6, 6, 7, 5, 8, 6, 5, 7, 6,
4, 5, 6, 8, 2, 9, 5, 9, 6, 7, 6, 6, 5, 6, 7, 7, 7, 7, 7, 5, 6, 6, 5,
5, 5, 5, 6, 6, 9, 6, 5, 5, 5, 7, 8, 8, 6, 5, 5, 7, 6, 7, 6, 7, 6, 7,
6, 6, 5, 4, 5, 4, 9, 5]
1 loop, best of 1: 15.1 ms per loop

Exercise: Explain why the list of 100 samples returned from the first two invocations of simulate are identical to one another, but the final invocation returns a different list of samples. Why did we write the runtime measurement comparison code to behave in this way?

Exercise: To memoize cdf we used the memoize function, which is based on a generic memoization strategy. How would you use the properties of cdf to implement a customized memoizer that uses less computation? Explain the pros/cons of this modification.

Exercise: For our application, does it make sense to memoize find_interval in the same way we did for pdf and cdf? Why or why not?

Challenge Problem: The while loop in find_interval performs a linear scan over the intervals (cdf(k-1), cdf(k)) until it finds the interval k that contains u. How would you use memoization so that (after sufficiently many invocations of simulate) we can find interval in sub-linear time? (Also describe what exactly your procedure is linear/sub-linear in.)

Closing remarks

In this post we have shown how to sample from a probability distribution over a countably infinite set, how to use functional programming abstractions to implement the algorithm in a modular and reusable way, and how to use the memoization technique to amortize the runtime cost. Remember that the decision to memoize a function must be taken with respect to the resource constraints in the application at hand. For system-critical samplers, it is essential to empirically profile the expected runtime versus memory footprint.