Pfeffer 2016: "Probabilistic programming is a way to create systems that help us make decisions in the face of uncertainty."
is a probabilistic model for which a graph expresses the conditional dependence structure between random variables
Two branches of graphical representations of distributions are commonly used:
by Pfeffer, A. Practical probabilistic programming. (Manning Publications, Co, 2016)
A motivating example by Austin Rochford:
A rare disease is present in one out of one hundred thousand people. A test gives the correct diagnosis 99.9% of the time. What is the probability that a person that tests positive has the disease?
Conditional probability is the probability that one event will happen, given that another event has occured.
$$ \begin{align*} P(A\ |\ B) & = \textrm{the probability that } A \textrm{ will happen if we know that } B \textrm{ has happened} \\ & = \frac{P(A \textrm{ and } B)}{P(B)}. \end{align*} $$
Our question,
A rare disease is present in one out of one hundred thousand people. A test gives the correct diagnosis 99.9% of the time. What is the probability that a person that tests positive has the disease?
becomes
$$ \begin{align*} P(+) & = 10^{-5} \\ \\ P(\textrm{Test } +\ |\ +) & = 0.999 \\ P(\textrm{Test } -\ |\ -) & = 0.999 \\ \\ P(+\ |\ \textrm{Test } +) & =\ \textbf{?} \end{align*} $$
Bayes' theorem shows how to get from $P(B\ |\ A)$ to $P(A\ |\ B)$.
$P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}$
Bayes' Theorem follows directly from the definition of conditional probability.
$$ \begin{align*} P(A\ |\ B) P(B) & = P(A \textrm{ and } B) \\ & = P(B \textrm{ and } A) \\ & = P(B\ |\ A) P(A). \end{align*} $$
Dividing by $P(B)$ yields Bayes' theorem. Each component of Bayes' Theorem has a name. The posterior distribution is equal to the joint distribution divided by the marginal distribution of the evidence.
$$\color{red}{P(A\ |\ B)} = \frac{\color{blue}{P(B\ |\ A)\ P(A)}}{\color{green}{P(B)}}.$$
For many useful models the marginal distribution of the evidence is hard or impossible to calculate analytically.
The first step below follows from Bayes' theorem, the second step follows from the law of total probability.
$$ \begin{align*} P(+\ |\ \textrm{Test } +) & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +)} \\ & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +\ |\ +) P(+) + P(\textrm{Test } +\ |\ -) P(-)} \end{align*} $$
$$ \begin{align*} P(+) & = 10^{-5} \\ P(\textrm{Test } +\ |\ +) & = 0.999 \\ P(\textrm{Test } -\ |\ -) & = 0.999 \\ \\ P(+\ |\ \textrm{Test } +) & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +\ |\ +) P(+) + P(\textrm{Test } +\ |\ -) P(-)} \\ & = \frac{0.999 \times 10^{-5}}{0.999 \times 10^{-5} + 0.001 \times \left(1 - 10^{-5}\right)} \end{align*} $$
0.999 * 1e-5 / (0.999 * 1e-5 + 0.001 * (1 - 1e-5))
0.009891284975940117
Strikingly, a person that tests positive has a less than 1% chance of actually having the disease! (Worryingly, only 21% of doctors answered this question correctly in a recent study.)
Sequence of updating posterior probabilities as we observe increasing amounts of data (coin flips).
How does our inference changes as we observe more and more data?
example from
Davidson-Pilon, C. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. (Addison-Wesley Professional, 2015).
%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
figsize(11, 9)
import scipy.stats as stats
dist = stats.beta
n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
data = stats.bernoulli.rvs(0.5, size=n_trials[-1])
x = np.linspace(0, 1, 100)
# For the already prepared, I'm using Binomial's conj. prior.
for k, N in enumerate(n_trials):
sx = plt.subplot(len(n_trials) / 2, 2, k + 1)
plt.xlabel("$p$, probability of heads") \
if k in [0, len(n_trials) - 1] else None
plt.setp(sx.get_yticklabels(), visible=False)
heads = data[:N].sum()
y = dist.pdf(x, 1 + heads, 1 + N - heads)
plt.plot(x, y, label="observe %d tosses,\n %d heads" % (N, heads))
plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
plt.vlines(0.5, 0, 4, color="k", linestyles="--", lw=1)
leg = plt.legend()
leg.get_frame().set_alpha(0.4)
plt.autoscale(tight=True)
plt.suptitle("Bayesian updating of posterior probabilities",
y=1.02,
fontsize=14)
plt.tight_layout()
1.1) define priors
= prior knowledge: quantifying the uncertainty in latent variables (what we know about the cause before the measurement / experiment)
e.g. if we know the values would be somewhere around zero we might use a $\theta \sim Normal(0,1)$ (standard normal)
1.2) likelihood function: data generating mechanism
unnormalized form:
including marginal likelihood = model evidence = normalizing factor:
numerator integrated over all of $\theta$ to “integrate out $\theta$”
! model outputs and their validity are conditioned on the specification of the model itself
→ model is specified based on many assumptions which are largely unverifiable
Metropolis Hasting (MH) sampling
Hamiltonian Monte Carlo (HMC)
No U-Turn Sampler (NUTS)
automatic differentiation variation inference (ADVI)
Install PyMC3 with anaconda:
conda install -c conda-forge pymc3
by Pfeffer, A. Practical probabilistic programming. (Manning Publications, Co, 2016)
Distribution over values:
$$ \begin{align*} & X \sim Normal(\mu, \sigma) \\ & x = X.random(n=100) \end{align*} $$
Distribution over functions:
$$ \begin{align*} & Y \sim GaussianProcess(mean(x), cov(x)) \\ & y = Y.predict(x2) \end{align*} $$
Conditioning variables on each other:
$$ \begin{align*} & p \sim Beta(1, 1) \\ & z \sim Bernoulli(p) \rightarrow z|p \end{align*} $$
PyMC3 is concerned with two types of programming variables
Davidson-Pilon, C. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. (Addison-Wesley Professional, 2015).
stochastic variables are variables that are not deterministic, i.e., even if you knew all the values of the variables' parameters and components, it would still be random. Included in this category are instances of classes Poisson, DiscreteUniform, and Exponential.
beta_1 = pm.Uniform("beta_1", 0, 1)
beta_2 = pm.Uniform("beta_2", 0, 1)
betas = pm.Uniform("betas", 0, 1, shape=N)
deterministic variables are variables that are not random if the variables' parameters and components were known. This might be confusing at first: a quick mental check is if I knew all of variable foo's component variables, I could determine what foo's value is.
Deterministic
classvar1 = pm.Deterministic('var1', 3)
but also
var1 = beta_1 + beta_2
Load PyMC3 and generate a new model object. All variables in the new model are defined with
in its context.
with pm.Model() as test_model: # context management
parameter = pm.Exponential("poisson_param", 1)
data_generator = pm.Poisson("data_generator", parameter)
print("• Model initialized")
• Model initialized
Subsequent alterations of a model are applied in the same context:
with test_model:
data_plus_one = data_generator + 1
print("• Model extended")
• Model extended
Variables defined in a model automatically get an initial test-value. It is used as starting point for sampling in case no other leverage point is defined.
print("parameter.tag.test_value =", parameter.tag.test_value)
print("data_generator.tag.test_value =", data_generator.tag.test_value)
print("data_plus_one.tag.test_value =", data_plus_one.tag.test_value)
parameter.tag.test_value = 0.693147177890573 data_generator.tag.test_value = 0 data_plus_one.tag.test_value = 1
An individual starting point can be defined when a variable is added to the model:
with test_model:
parameter2 = pm.Exponential("poisson_param2", 1, testval=0.5)
print("\nparameter2.tag.test_value =", parameter2.tag.test_value)
parameter2.tag.test_value = 0.49999999904767284
define model:
import pymc3 as pm
## hide warning from Theano
from theano import config
config.warn.round = False
## set configuration
n = 100 # number of coin tosses
h = 66 # absolute number of heads
# parameters of Beta distribution: location of maximum
alpha = 2
beta = 2
niter = 1000 # number of iterations
## define model
with pm.Model() as model:
# define priors
p = pm.Beta('p', alpha=alpha, beta=beta)
# define likelihood
y = pm.Binomial('y', n=n, p=p, observed=h)
print("• Model initialized")
• Model initialized
print("• run inference")
with model:
start = pm.find_MAP() # Use MAP estimate (optimization) as the initial state for MCMC
step = pm.Metropolis() # Have a choice of samplers
trace = pm.sample(niter, step, start, random_seed=123, progressbar=True)
## plot results
plt.hist(trace['p'], 15, histtype='step', normed=True, label='post');
x = np.linspace(0, 1, 100)
plt.plot(x, stats.beta.pdf(x, alpha, beta), label='prior');
plt.legend(loc='best');
• run inference Optimization terminated successfully. Current function value: 3.665377 Iterations: 5 Function evaluations: 6 Gradient evaluations: 6
100%|██████████| 1000/1000 [00:00<00:00, 3533.59it/s]
Davidson-Pilon, C. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. (Addison-Wesley Professional, 2015).
Salvatier J, Wiecki TV, Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 https://doi.org/10.7717/peerj-cs.55
To introduce model definition, fitting and posterior analysis, we first consider a simple Bayesian linear regression model with normal priors for the parameters.
$$ Y \sim \mathcal{N}(\mu, \sigma^2) $$
We choose zero-mean normal priors with variance of 100 for both regression coefficients, which corresponds to weak information regarding the true parameter values. We choose a half-normal distribution (normal distribution bounded at zero) as the prior for $\sigma$.
$$ \alpha \sim \mathcal{N}(0, 100) $$ $$ \beta_i \sim \mathcal{N}(0, 100) $$ $$ \sigma \sim \lvert\mathcal{N}(0, 1){\rvert} $$
import numpy as np
import matplotlib.pyplot as plt
# Initialize random number generator
np.random.seed(123)
# True parameter values
α, σ = 1, 1
β = [1, 2.5]
# Size of dataset
size = 100
# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2
# Simulate outcome variable
Y = α + β[0]*X1 + β[1]*X2 + np.random.randn(size)*σ
print("• random data simulated")
• random data simulated
print("• plot information")
fig, axes = plt.subplots(1, 2, sharex=True, figsize=(10,4))
axes[0].scatter(X1, Y)
axes[1].scatter(X2, Y)
axes[0].set_ylabel('Y'); axes[0].set_xlabel('X1'); axes[1].set_xlabel('X2');
• plot information
linear_model = pm.Model()
with linear_model:
alpha = pm.Normal('alpha', mu=0, sd=10)
beta = pm.Normal('beta', mu=0, sd=10, shape=2)
sigma = pm.HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta[0]*X1 + beta[1]*X2
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
print("• linear model defined")
• linear model defined
map_estimate = pm.find_MAP(model=linear_model)
print(map_estimate)
Optimization terminated successfully. Current function value: 149.017982 Iterations: 16 Function evaluations: 21 Gradient evaluations: 21 {'alpha': array(0.9065985497559482), 'beta': array([ 0.94848602, 2.60705514]), 'sigma_log_': array(-0.032781470174030686)}
print("alpha:", map_estimate['alpha'])
print("beta:", map_estimate['beta'])
alpha: 0.9090521898977764 beta: [ 0.95140146 2.61437458]
applying an alternative optimizer:
from scipy import optimize
map_estimate = pm.find_MAP(model=linear_model, fmin=optimize.fmin_powell)
print("\nalpha:", map_estimate['alpha'])
print("beta:", map_estimate['beta'])
Optimization terminated successfully. Current function value: 149.019762 Iterations: 4 Function evaluations: 176 alpha: 0.9090521898977764 beta: [ 0.95140146 2.61437458]
a simulation-based approach such as Markov chain Monte Carlo (MCMC) can be used to obtain a Markov chain of values that, given the satisfaction of certain conditions, are indistinguishable from samples from the posterior distribution.
with linear_model:
# obtain starting values via MAP
start = pm.find_MAP(fmin = optimize.fmin_powell)
# instantiate sample
step = pm.NUTS(scaling = start)
# draw 2000 posterior samples
trace = pm.sample(2000, step, start = start)
Optimization terminated successfully. Current function value: 149.019762 Iterations: 4 Function evaluations: 176
100%|██████████| 2000/2000 [00:06<00:00, 328.55it/s]
tmp = pm.traceplot(trace)
pm.summary(trace)
alpha: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 0.905 0.097 0.002 [0.724, 1.110] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| 0.708 0.843 0.905 0.969 1.103 beta: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 0.949 0.089 0.002 [0.778, 1.129] 2.601 0.499 0.013 [1.653, 3.585] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| 0.771 0.893 0.949 1.007 1.124 1.644 2.246 2.605 2.945 3.577 sigma: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 0.989 0.068 0.002 [0.858, 1.116] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| 0.873 0.941 0.984 1.034 1.140
Salvatier J, Wiecki TV, Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 https://doi.org/10.7717/peerj-cs.55
Generalized Linear Models (GLMs) are a class of flexible models that are widely used to estimate regression relationships between a single outcome variable and one or multiple predictors. Because these models are so common, PyMC3 offers a glm submodule that allows flexible creation of various GLMs with an intuitive R-like syntax.
# Convert X and Y to a pandas DataFrame
import pandas
df = pandas.DataFrame({'x1': X1, 'x2': X2, 'y': Y})
from pymc3.glm import glm
with pm.Model() as glm_model:
glm('y ~ x1 + x2', df)
trace = pm.sample(5000)
The error distribution, if not specified via the family argument, is assumed to be normal. In the case of logistic regression, this can be modified by passing in a Binomial family object.
from pymc3.glm.families import Binomial
# convert to binary outcome
df_logistic = pandas.DataFrame({'x1': X1, 'y': Y > np.median(Y)})
# define logistic regression model
with pm.Model() as model_glm_logistic:
glm('y ~ x1', df_logistic, family=Binomial())
size = 200
true_intercept = 1
true_slope = 2
x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + np.random.normal(scale=.5, size=size)
data = dict(x=x, y=y)
print("• random data generated")
• random data generated
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model')
ax.plot(x, y, 'x', label='sampled data')
ax.plot(x, true_regression_line, label='true regression line', lw=2.)
plt.legend(loc=0);
with pm.Model() as model_glm:
# Define priors
sigma = pm.HalfCauchy('sigma', beta=10, testval=1.)
intercept = pm.Normal('Intercept', 0, sd=20)
x_coeff = pm.Normal('x', 0, sd=20)
# Define likelihood
likelihood = pm.Normal('y', mu=intercept + x_coeff * x,
sd=sigma, observed=y)
# Inference
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling
Optimization terminated successfully. Current function value: 161.171367 Iterations: 15 Function evaluations: 20 Gradient evaluations: 20
with pm.Model() as model_glm2:
# specify glm and pass in data. The resulting linear model, its likelihood and
# and all its parameters are automatically added to our model.
pm.glm.glm('y ~ x', data)
# Inference
start = pm.find_MAP()
step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(2000, step, progressbar=False) # draw 2000 posterior samples using NUTS sampling
Optimization terminated successfully. Current function value: 182.804440 Iterations: 16 Function evaluations: 21 Gradient evaluations: 21
plt.figure(figsize=(7, 7))
pm.traceplot(trace[100:])
plt.tight_layout();
<matplotlib.figure.Figure at 0x7fd395b9f2b0>
plt.figure(figsize=(7, 7))
plt.plot(x, y, 'x', label='data')
pm.glm.plot_posterior_predictive(trace, samples=100,
label='posterior predictive regression lines')
plt.plot(x, true_regression_line, label='true regression line', lw=3., c='y')
plt.title('Posterior predictive regression lines')
plt.legend(loc=0)
plt.xlabel('x')
plt.ylabel('y');
Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all county's of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil. Here we'll investigate this differences and try to make predictions of radonlevels in different county's based on the county itself and the presence of a basement. In this example we'll look at Minnesota, a state that contains 85 county's in which different measurements are taken, ranging from 2 till 80 measurements per county.