[tricks0]    Tricks: Advanced Use of the
      BUGS Language


Contents

    Specifying a new sampling distribution
    Specifying a new prior distribution
    Specifying a discrete prior on a set of values
    Using pD and DIC
    Mixtures of models of different complexity
    Where the size of a set is a random quantity
    Assessing sensitivity to prior assumptions
    Modelling unknown denominators
    Handling unbalanced datasets
    Use of the "cut" function

Specifying a new sampling distribution [top]
Suppose we wish to use a sampling distribution that is not included in the list of standard distributions , in which an observation x[i] contributes a likelihood term L[i] . We may use the "zeros trick": a Poisson(phi) observation of zero has likelihood exp(-phi), so if our observed data is a set of 0's, and phi[i] is set to - log(L[i]), we will obtain the correct likelihood contribution. (Note that phi[i] should always be > 0 as it is a Poisson mean, and so we may need to add a suitable constant to ensure that it is positive.) This trick is illustrated by an example new-sampling in which a normal likelihood is constructed (using the zeros trick) and compared to the standard analysis.

   C <- 10000    # this just has to be large enough to ensure all phi[i]'s > 0
   for (i in 1:N) {
      zeros[i] <- 0
      phi[i] <- -log(L[i]) + C
      zeros[i] ~ dpois(phi[i])
   }

This trick allows arbitrary sampling distributions to be used, and is particularly suitable when, say, dealing with truncated distributions.

A new observation x.pred can be predicted by specifying it as missing in the data-file and assigning it a uniform prior, e.g.

   x.pred ~ dflat()    # improper uniform prior on new x

However our example shows that this method can be very inefficient and give a very high MC error.

An alternative to using 'zeros' is the "ones trick", where the data is a set of 1's assumed to be the results of Bernoulli trials with probabilities p[i]. By making each p[i] proportional to L[i] (i.e. by specifying a scaling constant large enough to ensure all p[i]'s are < 1) the required likelihood term is provided.

   C <- 10000    # this just has to be large enough to ensure all p[i]'s < 1
   for (i in 1:N) {
      ones[i] <- 1
      p[i] <- L[i] / C
      ones[i] ~ dbern(p[i])
   }

Specifying a new prior distribution [top]
If, for a parameter theta , say, we want to use a prior distribution that is not part of the standard set , then we can use the 'zeros' trick (see above) at the prior level. A single zero Poisson observation (with mean phi = phi( theta )) contributes a term exp(-phi) to the likelihood for theta ; when this is combined with a 'flat' prior for theta the correct prior distribution results.

         zero <- 0
         theta ~ dflat()
         phi <-
expression for - log( desired prior for theta )
         zero ~ dpois(phi)

This is illustrated in
new-prior by an example in which a normal prior is constructed using the zeros trick and the results are compared to the standard formulation. It is important to note that this method produces high auto-correlation, poor convergence and high MC error, so it is computationally slow and long runs are necessary.

Specifying a discrete prior on a set of values [top]
Suppose you want a parameter D to take one of a set of values, d[1], ..., d[K], say, with probabilities p[1], ..., p[K]. Then specify the arrays d[1:K] and p[1:K] in a data-file (or in the model) and use:

         M ~ dcat(p[])         # choose which element of d[] to use.
         D <- d[M]

This is illustrated by an example
t-df of learning about the degrees of freedom of a t-distribution.

If the discrete prior is put on too coarse a grid, then there may be numerical problems (crashes), unless good initial values are provided, and very poor mixing. It is therefore advised to either use a continuous prior or a discrete prior on a fine grid - see the
t-df example.

Using pD and DIC [top]
Here we make a number of observations regarding the use of DIC and pD - for a full discussion see Spiegelhalter  et al . (2002) :

1) DIC is intended as a generalisation of Akaike's Information Criterion (AIC). For non-hierarchical models, pD should be approximately the true number of parameters.

2) Slightly different values of Dhat (and hence pD and DIC) can be obtained depending on the parameterisation used for the prior distribution. For example, consider the precision
tau (1 / variance) of a normal distribution. The two priors

      tau ~ dgamma(0.001, 0.001)
and
      log.tau ~ dunif(-10, 10); log(tau) <- log.tau

are essentially identical but will give slightly different results for Dhat: for the first prior the stochastic parent is tau and hence the posterior mean of tau is substituted in Dhat, while in the second parameterisation the stochastic parent is log.tau and hence the posterior mean of log( tau ) is substituted in Dhat.

3) For sampling distributions that are log-concave in their stochastic parents, pD is guaranteed to be positive (provided the simulation has converged). However, it is theoretically possible to get negative values. We have obtained negative pD's in the following situations:


i) with non-log-concave likelihoods (e.g. Student-t distributions) when there is substantial conflict between prior and data;
ii) when the posterior distribution for a parameter is symmetric and bimodal, and so the posterior mean is a very poor summary statistic and gives a very large deviance.


4) No MC error is available on the DIC. MC error on Dbar can be obtained by monitoring deviance and is generally quite small. The primary concern is to ensure convergence of Dbar - it is therefore worthwhile checking the stability of Dbar over a long chain.

5) The minimum DIC estimates the model that will make the best short-term predictions, in the same spirit as Akaike's criterion. However, if the difference in DIC is, say, less than 5, and the models make very different inferences, then it could be misleading just to report the model with the lowest DIC.

6) DICs are comparable only over models with exactly the same observed data, but there is no need for them to be nested.

7) DIC differs from Bayes factors and BIC in both form and aims.

8) Caution is advisable in the use of DIC until more experience has been gained.
It is important to note that the calculation of DIC will be disallowed for certain models. Please see the WinBUGS 1.4 web-page for details:

http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml

Mixtures of models of different complexity [top]
Suppose we assume that each observation, or group of observations, is from one of a set of distributions, where the members of the set have different complexity. For example, we may think data for each person's growth curve comes from either a linear or quadratic line. We might think we would require 'reversible jump' techniques, but this is not the case as we are really only considering a single mixture model as a sampling distribution. Thus standard methods for setting up mixture distributions can be adopted, but with components having different numbers of parameters.

The
mixtures example illustrates how this is handled in WinBUGS, using a set of simulated data.

Naturally, the standard warnings about mixture distributions apply, in that convergence may be poor and careful parameterisation may be necessary to avoid some of the components becoming empty.

Where the size of a set is a random quantity [top]
Suppose the size of a set is a random quantity: this naturally occurs in 'changepoint' problems where observations up to an unknown changepoint K come from one model, and after K come from another. Note that we cannot use the construction

         for (i in 1:K) {
               y[i] ~
model 1
         }
         for (i in (K + 1):N) {
               y[i] ~
model 2
         }

since the index for a loop cannot be a random quantity. Instead we can use the step function to set up an indicator as to which set each observation belongs to:

         for (i in 1:N) {
               ind[i] <- 1 + step(i - K - 0.01)
      # will be 1 for all i <= K, 2 otherwise
               y[i] ~
model ind[i]
         }

This is illustrated in
random-sets by the problem of adding up terms in a series of unknown length, and in Stagnant by a changepoint problem.

Assessing sensitivity to prior assumptions [top]
One way to do this is to repeat the analysis under different prior assumptions, but within the same simulation in order to aid direct comparison of results. Assuming the consequences of K prior distributions are to be compared:

a) replicate the dataset K times within the model code;
b) set up a loop to repeat the analysis for each prior, holding results in arrays;
c) compare results using the
'compare' facility.

The example prior-sensitivity explores six different suggestions for priors on the random-effects variance in a meta-analysis.

Modelling unknown denominators [top]
Suppose we have an unknown Binomial denominator for which we wish to express a prior distribution. It can be given a Poisson prior but this makes it difficult to express a reasonably uniform distribution. Alternatively a continuous distribution could be specified and then the 'round' function used. For example, suppose we are told that a fair coin has come up heads 10 times - how many times has it been tossed?

         model {
               r <- 10
               p <- 0.5
               r ~ dbin(p, n)
               n.cont ~ dunif(1, 100)
               n <- round(n.cont)
         }

   node   mean   sd   MC error   2.5%   median   97.5%   start   sample
   n   21.08   4.794   0.07906   13.0   21.0   32.0   1001   5000
   n.cont   21.08   4.804   0.07932   13.31   20.6   32.0   1001   5000
   
Assuming a uniform prior for the number of tosses, we can be 95% sure that the coin has been tossed between 13 and 32 times. A discrete prior on the integers could also have been used in this context.

Handling unbalanced datasets [top]
Suppose we observe the following data on three individuals:

      Person 1: 13.2
      Person 2: 12.3 , 14.1
      Person 3: 11.0, 9.7, 10.3, 9.6
   
There are three different ways of entering such 'ragged' data into WinBUGS:

1. Fill-to-rectangular: Here the data is 'padded out' by explicitly including the missing data, i.e.

   y[,1]   y[,2]   y[,3]   y[,4]
   13.2   NA   NA   NA
   12.3   14.1   NA   NA
   11.0   9.7   10.3   9.6
END

or list(y = structure(.Data = c(13.2, NA, NA, NA, 12.3, 14.1, NA, NA, 11.0, 9.7, 10.3, 9.6), .Dim = c(3, 4)).

A model such as
y[i, j] ~ dnorm(mu[i], 1) can then be fitted. This approach is inefficient unless one explicitly wishes to estimate the missing data.

2. Nested indexing: Here the data are stored in a single array and the associated person is recorded as a factor, i.e.

   y[]   person[]
   13.2   1
   12.3   2
   14.1   2
   11.0   3
   9.7   3
   10.3   3
   9.6   3
END

or list(y = c(13.2, 12.3, 14.1, 11.0, 9.7, 10.3, 9.6), person = c(1, 2, 2, 3, 3, 3, 3)).

A model such as
y[k] ~ dnorm(mu[person[k]], 1) can then be fitted. This seems an efficient and clear way to handle the problem.

3. Offset: Here an 'offset' array holds the position in the data array at which each person's data starts. For example, the data might be

list(y = c(13.2, 12.3, 14.1, 11.0, 9.7, 10.3, 9.6), offset = c(1, 2, 4, 8))

and lead to a model containing the code

   for (k in offset[i]:(offset[i + 1] - 1)) {
      y[k] ~ dnorm(mu[i], 1)
   }

The danger with this method is that it relies on getting the offsets correct and they are difficult to check.
See the ragged example for a full worked example of these methods.


Use of the "cut" function [top]
Suppose we observe some data that we do not wish to contribute to the parameter estimation and yet we wish to consider as part of the model. This might happen, for example:

a) when we wish to make predictions on some individuals on whom we have observed some partial data that we do not wish to use for parameter estimation;
b) when we want to use data to learn about some parameters and not others;
c) when we want evidence from one part of a model to form a prior distribution for a second part of the model, but we do not want 'feedback' from this second part.

The "cut" function forms a kind of 'valve' in the graph: prior information is allowed to flow 'downwards' through the cut, but likelihood information is prevented from flowing upwards.

For example, the following code leaves the distribution for theta unchanged by the observation y.

   model
   {
      y <- 2
      y ~ dnorm(theta.cut, 1)
      theta.cut <- cut(theta)
      theta ~ dnorm(0, 1)
   }