Probabilistic Programming with PyMC3

Florian Endel

florian@endel.at

Overview

  • What ist Probabilistic Programming
    • Probabilistic Models
    • Bayesian Method
    • Sampler, Fitting function
    • Outline & elements of a probabilistic programming
  • Introduction to PyMC3
    • Structure, elements, functions
  • Examples
    • Coin toss with PyMC3
    • linear regression LM
    • general linear models GLM
    • multilevel / hirarchical models

Main sources & references

  • FOSSCON 2017: talk by Austin Rochford on PyMC3
  • PyCon 2017: talk by Christopher Fonnesbeck on PyMC3
  • PyMC3 documentation: docs.pymc.io
  • Davidson-Pilon, C. Bayesian methods for hackers: probabilistic programming and Bayesian inference. (Addison-Wesley, 2016).
  • Kruschke, J. K. Doing Bayesian data analysis: a tutorial with R, JAGS, and Stan. (Academic Press, 2015).
  • Pfeffer, A. Practical probabilistic programming. (Manning Publications, Co, 2016).

What is Probabilistic Programming (PP)

  • ... is any language that describes and fits probability models
  • ... languages aim to close representational gap of probabilistic models
  • ... unifying general purpose programming with probabilistic modeling
  • ... facilitates the application of Bayesian methods

Pfeffer 2016: "Probabilistic programming is a way to create systems that help us make decisions in the face of uncertainty."

  • technology is not new
    • Bayesian, data science
    • e.g. software "winbugs" from 90s', object pascal, domain specific language
    • successor STAN, C++, during complete
  • new name & approach
  • nothing to do with randomly generated code
  • most important characteristics:
    • stochastic language primitives
    • inference mechanisms

Probabilistic Graphical Model

is a probabilistic model for which a graph expresses the conditional dependence structure between random variables

Two branches of graphical representations of distributions are commonly used:

  • Bayesian networks

  • Markov random fields

by Pfeffer, A. Practical probabilistic programming. (Manning Publications, Co, 2016)

Bayes' theorem

  • Bayes formular: only estimator needed
  • use probability distribution to characterize
    • what we know and
    • don’t know about unknown quantities we care about
  • Prior: all information we know about our unknowns before we observe the world or carry out an experiment
  • Posterior: belief after observation/meassurment
  • process:
    • take data from the observations = effects = meassurements = $y$
    • update the prior probability to get
    • posterior probability (posterior = after the observation)
    • thats why posterior = $\theta$ conditioned on $y$

Bayesian Inference

A motivating example by Austin Rochford:

A rare disease is present in one out of one hundred thousand people. A test gives the correct diagnosis 99.9% of the time. What is the probability that a person that tests positive has the disease?

Conditional Probability

Conditional probability is the probability that one event will happen, given that another event has occured.

$$ \begin{align*} P(A\ |\ B) & = \textrm{the probability that } A \textrm{ will happen if we know that } B \textrm{ has happened} \\ & = \frac{P(A \textrm{ and } B)}{P(B)}. \end{align*} $$

Our question,

A rare disease is present in one out of one hundred thousand people. A test gives the correct diagnosis 99.9% of the time. What is the probability that a person that tests positive has the disease?

becomes

$$ \begin{align*} P(+) & = 10^{-5} \\ \\ P(\textrm{Test } +\ |\ +) & = 0.999 \\ P(\textrm{Test } -\ |\ -) & = 0.999 \\ \\ P(+\ |\ \textrm{Test } +) & =\ \textbf{?} \end{align*} $$

Bayes' Theorem

Bayes' theorem shows how to get from $P(B\ |\ A)$ to $P(A\ |\ B)$.

$P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}$

Bayes' Theorem follows directly from the definition of conditional probability.

$$ \begin{align*} P(A\ |\ B) P(B) & = P(A \textrm{ and } B) \\ & = P(B \textrm{ and } A) \\ & = P(B\ |\ A) P(A). \end{align*} $$

Dividing by $P(B)$ yields Bayes' theorem. Each component of Bayes' Theorem has a name. The posterior distribution is equal to the joint distribution divided by the marginal distribution of the evidence.

$$\color{red}{P(A\ |\ B)} = \frac{\color{blue}{P(B\ |\ A)\ P(A)}}{\color{green}{P(B)}}.$$

For many useful models the marginal distribution of the evidence is hard or impossible to calculate analytically.

The first step below follows from Bayes' theorem, the second step follows from the law of total probability.

$$ \begin{align*} P(+\ |\ \textrm{Test } +) & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +)} \\ & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +\ |\ +) P(+) + P(\textrm{Test } +\ |\ -) P(-)} \end{align*} $$

$$ \begin{align*} P(+) & = 10^{-5} \\ P(\textrm{Test } +\ |\ +) & = 0.999 \\ P(\textrm{Test } -\ |\ -) & = 0.999 \\ \\ P(+\ |\ \textrm{Test } +) & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +\ |\ +) P(+) + P(\textrm{Test } +\ |\ -) P(-)} \\ & = \frac{0.999 \times 10^{-5}}{0.999 \times 10^{-5} + 0.001 \times \left(1 - 10^{-5}\right)} \end{align*} $$

In [128]:
0.999 * 1e-5 / (0.999 * 1e-5 + 0.001 * (1 - 1e-5))
Out[128]:
0.009891284975940117

Strikingly, a person that tests positive has a less than 1% chance of actually having the disease! (Worryingly, only 21% of doctors answered this question correctly in a recent study.)

Connection to PP?

  • Outputs from probabilistic programs are always in probabilistic terms
  • don’t get a single value / point estimate but entire distribution of it
  • allows to provide measures of uncertainty associated with those estimates/results
  • e.g.: what is the probability that the true value is bigger than some value of interest?

Example: Coin-flip

Sequence of updating posterior probabilities as we observe increasing amounts of data (coin flips).

How does our inference changes as we observe more and more data?

  • curves: posterior probabilities
  • uncertainty about true mean (e.g. fair coin) is proportional to the spread (width)

example from

Davidson-Pilon, C. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. (Addison-Wesley Professional, 2015).

In [123]:
%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
figsize(11, 9)

import scipy.stats as stats

dist = stats.beta
n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
data = stats.bernoulli.rvs(0.5, size=n_trials[-1])
x = np.linspace(0, 1, 100)

# For the already prepared, I'm using Binomial's conj. prior.
for k, N in enumerate(n_trials):
    sx = plt.subplot(len(n_trials) / 2, 2, k + 1)
    plt.xlabel("$p$, probability of heads") \
        if k in [0, len(n_trials) - 1] else None
    plt.setp(sx.get_yticklabels(), visible=False)
    heads = data[:N].sum()
    y = dist.pdf(x, 1 + heads, 1 + N - heads)
    plt.plot(x, y, label="observe %d tosses,\n %d heads" % (N, heads))
    plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
    plt.vlines(0.5, 0, 4, color="k", linestyles="--", lw=1)

    leg = plt.legend()
    leg.get_frame().set_alpha(0.4)
    plt.autoscale(tight=True)


plt.suptitle("Bayesian updating of posterior probabilities",
             y=1.02,
             fontsize=14)

plt.tight_layout()

How to?

1) encode a probability model

1.1) define priors

  • = prior knowledge: quantifying the uncertainty in latent variables (what we know about the cause before the measurement / experiment)

  • e.g. if we know the values would be somewhere around zero we might use a $\theta \sim Normal(0,1)$ (standard normal)

  • $\theta \sim Normal(0, 100)$ → encoding lack of information, no real preference on what value it is supposed to be
  • $\theta \sim Beta(1, 50)$ → e.g. prevalance of a rare disease

1.2) likelihood function: data generating mechanism

  • conditions model on the observed data = it’s how we update the prior probability
  • e.g. how many people are visiting a website in period of time? → $x_{visitors} \sim Poisson(\mu)$
  • e.g. outcomes in height and weight in patients? → $x \sim Normal(\mu, \sigma^2)$

2) Infer Values

  • for latent variables, the posterior distributions → results are not values but probability distributions!
  • update prior distribution with likelihood to get a posterior

unnormalized form:

including marginal likelihood = model evidence = normalizing factor:

numerator integrated over all of $\theta$ to “integrate out $\theta$”

  • easy for 1 or 2 parameters = integration with a couple of variables
  • not straight forward / possible for many parameters
  • numerical methods are required
  • probabilistic programming abstracts the “inference procedure” (= calculating marginal likelihood)

3) check model

  • are results reasonable?
  • often overlooked & forgotten

! model outputs and their validity are conditioned on the specification of the model itself

→ model is specified based on many assumptions which are largely unverifiable

  • one method: simulate data from model and compare to data used to fit the model

Calculating Posteriors

  • analytically impossible, numerically challenging
  • rely on approximations
  • several possibilities:
    • map: brute force maximization of the unnormalized posteriors, → modal value but no distribution/uncertainty, not really bayesian
    • laplace normal approximation: assume posterior is normal distribution, less realistic of skewness or covariance
    • ampling based approaches

Markov Chain Monte Carlo

  • defacto standard
  • we are not able to sample independently from posterior because we do not know what it is
  • we can sample in a dependent fashion
  • next value depends partly on the current value of the chain
  • it’s distribution is expected to indistinguishable from the posterior distribution
  • several implementations

Metropolis Hasting (MH) sampling

  • inefficient, ~24% optimal acceptance rate
  • 1st generation MCMC

Hamiltonian Monte Carlo (HMC)

  • better alternative, abandon random walk behavior
  • integrate gradient information (→ strength of theano!)
  • doesn’t get as many samples rejected as Metropolis

No U-Turn Sampler (NUTS)

  • extension of HMC
  • autotunes hyper-parameters
  • Hoffmann and Gelman 2014

automatic differentiation variation inference (ADVI)

  • chooses approximations automatically

MCMC visualized

MCMC Interactive Gallery

Introduction to PyMC3

  • started in 2003 by Christopher Fonnesbeck
  • PP framework for fitting arbitrary probability models
  • Fits Bayesian statistical models with Markov chain Monte Carlo and other algorithms.
  • Includes a large suite of well-documented statistical distributions.
  • Uses NumPy and Theano for fast numerical computation.
    • Computation optimization and dynamic C compilation
    • Numpy broadcasting and advanced indexing
  • Creates posterior summaries including tables and plots.
  • Several convergence diagnostics and goodness-of-fit procedures are available.
  • Extensible: easily incorporates custom MCMC algorithms and unusual probability distributions.
  • MCMC loops can be embedded in larger programs, and results can be analyzed with the full power of Python.
  • Intuitive model specification syntax, for example, x ~ N(0,1) translates to x = Normal(0,1)
  • Powerful sampling algorithms such as Hamiltonian Monte Carlo, NUTS

Install PyMC3 with anaconda:

conda install -c conda-forge pymc3

Theano

  • python library/api
  • specifying & evaluating math expersions using tensors (multi-dimensional arrays)
  • dynamic c-code generation
  • !! very efficient fast automatic symbolic differentiation → fast automatic gradients!
  • currently not in development

by Pfeffer, A. Practical probabilistic programming. (Manning Publications, Co, 2016)

Stochastic language "primitives"

Distribution over values:

  • do not assign values but probability distributions to variables
  • sample to draw random values

$$ \begin{align*} & X \sim Normal(\mu, \sigma) \\ & x = X.random(n=100) \end{align*} $$

Stochastic language "primitives"

Distribution over functions:

  • each random sample/realisation is a random function, not a value

$$ \begin{align*} & Y \sim GaussianProcess(mean(x), cov(x)) \\ & y = Y.predict(x2) \end{align*} $$

Stochastic language "primitives"

Conditioning variables on each other:

  • allows to specify probability models on a high level

$$ \begin{align*} & p \sim Beta(1, 1) \\ & z \sim Bernoulli(p) \rightarrow z|p \end{align*} $$

Variables in PyMC3

PyMC3 is concerned with two types of programming variables

  • stochastic
  • deterministic

Davidson-Pilon, C. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. (Addison-Wesley Professional, 2015).

stochastic variables are variables that are not deterministic, i.e., even if you knew all the values of the variables' parameters and components, it would still be random. Included in this category are instances of classes Poisson, DiscreteUniform, and Exponential.

  • random distribution
  • work like functions returning a random observation from a defined distribution
  • initialized with name attribute
  • and additional parameters, depending on the type
beta_1 = pm.Uniform("beta_1", 0, 1)
beta_2 = pm.Uniform("beta_2", 0, 1)
  • vector of variables can be created using the ''shape'' argument
betas = pm.Uniform("betas", 0, 1, shape=N)

deterministic variables are variables that are not random if the variables' parameters and components were known. This might be confusing at first: a quick mental check is if I knew all of variable foo's component variables, I could determine what foo's value is.

  • derived from the Deterministic class
  • parameterized by a value or function
var1 = pm.Deterministic('var1', 3)

but also

var1 = beta_1 + beta_2

PyMC3 structure

Load PyMC3 and generate a new model object. All variables in the new model are defined within its context.

In [113]:
with pm.Model() as test_model: # context management
    parameter = pm.Exponential("poisson_param", 1)
    data_generator = pm.Poisson("data_generator", parameter)
    
print("• Model initialized")
• Model initialized

Subsequent alterations of a model are applied in the same context:

In [114]:
with test_model:
    data_plus_one = data_generator + 1
    
print("• Model extended")
• Model extended

inspect variables

Variables defined in a model automatically get an initial test-value. It is used as starting point for sampling in case no other leverage point is defined.

In [115]:
print("parameter.tag.test_value =", parameter.tag.test_value)
print("data_generator.tag.test_value =", data_generator.tag.test_value)
print("data_plus_one.tag.test_value =", data_plus_one.tag.test_value)
parameter.tag.test_value = 0.693147177890573
data_generator.tag.test_value = 0
data_plus_one.tag.test_value = 1

An individual starting point can be defined when a variable is added to the model:

In [116]:
with test_model:
    parameter2 = pm.Exponential("poisson_param2", 1, testval=0.5)

print("\nparameter2.tag.test_value =", parameter2.tag.test_value)
parameter2.tag.test_value = 0.49999999904767284

Inference: Calculating Posteriors

Examples & Applications

  • Coin Toss
  • normal distribution
  • Linear Model (LM)
  • Generalized Linear Model (GLM)
  • Hirarchical Model

Coin Toss example with PyMC3

define model:

In [55]:
import pymc3 as pm

## hide warning from Theano
from theano import config
config.warn.round = False

## set configuration
n = 100 # number of coin tosses
h = 66 # absolute number of heads
# parameters of Beta distribution: location of maximum
alpha = 2 
beta = 2

niter = 1000 # number of iterations

## define model
with pm.Model() as model: 
    # define priors
    p = pm.Beta('p', alpha=alpha, beta=beta)

    # define likelihood
    y = pm.Binomial('y', n=n, p=p, observed=h)

print("• Model initialized")
• Model initialized
In [57]:
print("• run inference")

with model:
    start = pm.find_MAP() # Use MAP estimate (optimization) as the initial state for MCMC
    step = pm.Metropolis() # Have a choice of samplers
    trace = pm.sample(niter, step, start, random_seed=123, progressbar=True)

## plot results
plt.hist(trace['p'], 15, histtype='step', normed=True, label='post');
x = np.linspace(0, 1, 100)
plt.plot(x, stats.beta.pdf(x, alpha, beta), label='prior');
plt.legend(loc='best');
• run inference
Optimization terminated successfully.
         Current function value: 3.665377
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 6
100%|██████████| 1000/1000 [00:00<00:00, 3533.59it/s]

linear regression – the Bayesian way

Davidson-Pilon, C. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. (Addison-Wesley Professional, 2015).

Salvatier J, Wiecki TV, Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 https://doi.org/10.7717/peerj-cs.55

To introduce model definition, fitting and posterior analysis, we first consider a simple Bayesian linear regression model with normal priors for the parameters.

  • Y: outcome, normal distributed priors for the parameters with
    • $\mu$: expected value
    • $\sigma$: observation error

$$ Y \sim \mathcal{N}(\mu, \sigma^2) $$

  • $ \mu $: is a linear function of two predictor variables $X_1$ and $X_2$
  • $ \alpha $: intercept
  • $ \beta_1 $, $ \beta_2 $: coefficients for covariates $X_i$ $$ \mu = \alpha + \beta_1 X_1 + \beta_2 X_2 = X\beta + \epsilon $$

We choose zero-mean normal priors with variance of 100 for both regression coefficients, which corresponds to weak information regarding the true parameter values. We choose a half-normal distribution (normal distribution bounded at zero) as the prior for $\sigma$.

$$ \alpha \sim \mathcal{N}(0, 100) $$ $$ \beta_i \sim \mathcal{N}(0, 100) $$ $$ \sigma \sim \lvert\mathcal{N}(0, 1){\rvert} $$

Generating random data

In [63]:
import numpy as np
import matplotlib.pyplot as plt

# Initialize random number generator
np.random.seed(123)

# True parameter values
α, σ = 1, 1
β = [1, 2.5]

# Size of dataset
size = 100

# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

# Simulate outcome variable
Y = α + β[0]*X1 + β[1]*X2 + np.random.randn(size)*σ

print("• random data simulated")
• random data simulated
In [65]:
print("• plot information")

fig, axes = plt.subplots(1, 2, sharex=True, figsize=(10,4))
axes[0].scatter(X1, Y)
axes[1].scatter(X2, Y)
axes[0].set_ylabel('Y'); axes[0].set_xlabel('X1'); axes[1].set_xlabel('X2');
• plot information

Model Specification

In [72]:
linear_model = pm.Model()

with linear_model:

    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    sigma = pm.HalfNormal('sigma', sd=1)

    # Expected value of outcome
    mu = alpha + beta[0]*X1 + beta[1]*X2

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

print("• linear model defined")
• linear model defined

Model fitting: maximum a posteriori methods (MAP)

  • mode of the posterior distribution
  • generally found using numerical optimization methods
  • often fast and easy
  • only finds a local optimum
  • only gives a point estimate
  • no estimate of uncertainty (therefore not "really Bayesian")
  • biased if
    • the mode isn’t representative of the distribution
    • high dimensions, high density, low probability
In [76]:
map_estimate = pm.find_MAP(model=linear_model)

print(map_estimate)
Optimization terminated successfully.
         Current function value: 149.017982
         Iterations: 16
         Function evaluations: 21
         Gradient evaluations: 21
{'alpha': array(0.9065985497559482), 'beta': array([ 0.94848602,  2.60705514]), 'sigma_log_': array(-0.032781470174030686)}
In [84]:
print("alpha:", map_estimate['alpha'])
print("beta:", map_estimate['beta'])
alpha: 0.9090521898977764
beta: [ 0.95140146  2.61437458]

applying an alternative optimizer:

In [85]:
from scipy import optimize

map_estimate = pm.find_MAP(model=linear_model, fmin=optimize.fmin_powell)

print("\nalpha:", map_estimate['alpha'])
print("beta:", map_estimate['beta'])
Optimization terminated successfully.
         Current function value: 149.019762
         Iterations: 4
         Function evaluations: 176

alpha: 0.9090521898977764
beta: [ 0.95140146  2.61437458]

Model fitting: sampling

a simulation-based approach such as Markov chain Monte Carlo (MCMC) can be used to obtain a Markov chain of values that, given the satisfaction of certain conditions, are indistinguishable from samples from the posterior distribution.

  • No-U-Turn Sampler (NUTS)
  • inference algorithm called "auto-diff variational inference" (ADVI)
In [87]:
with linear_model:
    # obtain starting values via MAP
    start = pm.find_MAP(fmin = optimize.fmin_powell)
    
    # instantiate sample
    step = pm.NUTS(scaling = start)
    
    # draw 2000 posterior samples
    trace = pm.sample(2000, step, start = start)
Optimization terminated successfully.
         Current function value: 149.019762
         Iterations: 4
         Function evaluations: 176
100%|██████████| 2000/2000 [00:06<00:00, 328.55it/s]

Posterior analysis

  • plotting and summarization for inspecting the sampling output
In [117]:
tmp = pm.traceplot(trace)
In [91]:
pm.summary(trace)
alpha:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.905            0.097            0.002            [0.724, 1.110]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.708          0.843          0.905          0.969          1.103


beta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.949            0.089            0.002            [0.778, 1.129]
  2.601            0.499            0.013            [1.653, 3.585]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.771          0.893          0.949          1.007          1.124
  1.644          2.246          2.605          2.945          3.577


sigma:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.989            0.068            0.002            [0.858, 1.116]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.873          0.941          0.984          1.034          1.140

Generalized Linear Models with PyMC3

Salvatier J, Wiecki TV, Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 https://doi.org/10.7717/peerj-cs.55

Generalized Linear Models (GLMs) are a class of flexible models that are widely used to estimate regression relationships between a single outcome variable and one or multiple predictors. Because these models are so common, PyMC3 offers a glm submodule that allows flexible creation of various GLMs with an intuitive R-like syntax.

1) data from "linear model"

In [93]:
# Convert X and Y to a pandas DataFrame
import pandas

df = pandas.DataFrame({'x1': X1, 'x2': X2, 'y': Y})

model specification

from pymc3.glm import glm

with pm.Model() as glm_model:
    glm('y ~ x1 + x2', df)
    trace = pm.sample(5000)

The error distribution, if not specified via the family argument, is assumed to be normal. In the case of logistic regression, this can be modified by passing in a Binomial family object.

from pymc3.glm.families import Binomial

# convert to binary outcome
df_logistic = pandas.DataFrame({'x1': X1, 'y': Y > np.median(Y)})

# define logistic regression model
with pm.Model() as model_glm_logistic:
    glm('y ~ x1', df_logistic, family=Binomial())

2) new data

In [100]:
size = 200
true_intercept = 1
true_slope = 2

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + np.random.normal(scale=.5, size=size)

data = dict(x=x, y=y)

print("• random data generated")
• random data generated
In [102]:
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model')
ax.plot(x, y, 'x', label='sampled data')
ax.plot(x, true_regression_line, label='true regression line', lw=2.)
plt.legend(loc=0);

Model definition: manual

In [105]:
with pm.Model() as model_glm: 
    
    # Define priors
    sigma = pm.HalfCauchy('sigma', beta=10, testval=1.)
    intercept = pm.Normal('Intercept', 0, sd=20)
    x_coeff = pm.Normal('x', 0, sd=20)

    # Define likelihood
    likelihood = pm.Normal('y', mu=intercept + x_coeff * x,
                        sd=sigma, observed=y)

    # Inference
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling
Optimization terminated successfully.
         Current function value: 161.171367
         Iterations: 15
         Function evaluations: 20
         Gradient evaluations: 20

Build model: formular interface

  • shortcut provided by PyMC3
  • shorter code
  • exactly the same procedure as above
  • still customizable
In [107]:
with pm.Model() as model_glm2:
    # specify glm and pass in data. The resulting linear model, its likelihood and
    # and all its parameters are automatically added to our model.
    pm.glm.glm('y ~ x', data)
    
    # Inference
    start = pm.find_MAP()
    step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, progressbar=False) # draw 2000 posterior samples using NUTS sampling
Optimization terminated successfully.
         Current function value: 182.804440
         Iterations: 16
         Function evaluations: 21
         Gradient evaluations: 21
In [121]:
plt.figure(figsize=(7, 7))
pm.traceplot(trace[100:])
plt.tight_layout();
<matplotlib.figure.Figure at 0x7fd395b9f2b0>
In [110]:
plt.figure(figsize=(7, 7))
plt.plot(x, y, 'x', label='data')
pm.glm.plot_posterior_predictive(trace, samples=100,
                              label='posterior predictive regression lines')
plt.plot(x, true_regression_line, label='true regression line', lw=3., c='y')

plt.title('Posterior predictive regression lines')
plt.legend(loc=0)
plt.xlabel('x')
plt.ylabel('y');

Multilevel Model with PyMC3

Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all county's of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil. Here we'll investigate this differences and try to make predictions of radonlevels in different county's based on the county itself and the presence of a basement. In this example we'll look at Minnesota, a state that contains 85 county's in which different measurements are taken, ranging from 2 till 80 measurements per county.

Example from Chris Fonnesbeck