** Tricks: Advanced Use of the BUGS Language
**

Suppose we wish to use a sampling distribution that is not included in the

C <- 10000 # this just has to be large enough to ensure all phi[i]'s > 0

for (i in 1:N) {

zeros[i] <- 0

phi[i] <- -log(L[i]) + C

zeros[i] ~ dpois(phi[i])

}

This trick allows arbitrary sampling distributions to be used, and is particularly suitable when, say, dealing with truncated distributions.

A new observation x.pred can be predicted by specifying it as missing in the data-file and assigning it a uniform prior, e.g.

x.pred ~ dflat() # improper uniform prior on new x

However our example shows that this method can be very inefficient and give a very high MC error.

An alternative to using 'zeros' is the "ones trick", where the data is a set of 1's assumed to be the results of Bernoulli trials with probabilities p[i]. By making each p[i] proportional to L[i] (i.e. by specifying a scaling constant large enough to ensure all p[i]'s are < 1) the required likelihood term is provided.

C <- 10000 # this just has to be large enough to ensure all p[i]'s < 1

for (i in 1:N) {

ones[i] <- 1

p[i] <- L[i] / C

ones[i] ~ dbern(p[i])

}

If, for a parameter

zero <- 0

theta ~ dflat()

phi <-

zero ~ dpois(phi)

This is illustrated in

Suppose you want a parameter D to take one of a set of values, d[1], ..., d[K], say, with probabilities p[1], ..., p[K]. Then specify the arrays d[1:K] and p[1:K] in a data-file (or in the model) and use:

M ~ dcat(p[]) # choose which element of d[] to use.

D <- d[M]

This is illustrated by an example

If the discrete prior is put on too coarse a grid, then there may be numerical problems (crashes), unless good initial values are provided, and very poor mixing. It is therefore advised to either use a continuous prior or a discrete prior on a fine grid - see the

Here we make a number of observations regarding the use of DIC and pD - for a full discussion see

1) DIC is intended as a generalisation of Akaike's Information Criterion (AIC). For non-hierarchical models, pD should be approximately the true number of parameters.

2) Slightly different values of Dhat (and hence pD and DIC) can be obtained depending on the parameterisation used for the prior distribution. For example, consider the precision

tau ~ dgamma(0.001, 0.001) and

log.tau ~ dunif(-10, 10); log(tau) <- log.tau

are essentially identical but will give slightly different results for Dhat: for the first prior the stochastic parent is

3) For sampling distributions that are log-concave in their stochastic parents, pD is guaranteed to be positive (provided the simulation has converged). However, it is theoretically possible to get negative values. We have obtained negative pD's in the following situations:

i) with non-log-concave likelihoods (e.g. Student-t distributions) when there is substantial conflict between prior and data;

ii) when the posterior distribution for a parameter is symmetric and bimodal, and so the posterior mean is a very poor summary statistic and gives a very large deviance.

4) No MC error is available on the DIC. MC error on Dbar can be obtained by monitoring deviance and is generally quite small. The primary concern is to ensure convergence of Dbar - it is therefore worthwhile checking the stability of Dbar over a long chain.

5) The minimum DIC estimates the model that will make the best short-term predictions, in the same spirit as Akaike's criterion. However, if the difference in DIC is, say, less than 5, and the models make very different inferences, then it could be misleading just to report the model with the lowest DIC.

6) DICs are comparable only over models with exactly the same observed data, but there is no need for them to be nested.

7) DIC differs from Bayes factors and BIC in both form and aims.

8) Caution is advisable in the use of DIC until more experience has been gained.

Suppose we assume that each observation, or group of observations, is from one of a set of distributions, where the members of the set have different complexity. For example, we may think data for each person's growth curve comes from either a linear or quadratic line. We might think we would require 'reversible jump' techniques, but this is not the case as we are really only considering a single mixture model as a sampling distribution. Thus standard methods for setting up mixture distributions can be adopted, but with components having different numbers of parameters.

The

Naturally, the standard warnings about mixture distributions apply, in that convergence may be poor and careful parameterisation may be necessary to avoid some of the components becoming empty.

Suppose the size of a set is a random quantity: this naturally occurs in 'changepoint' problems where observations up to an unknown changepoint K come from one model, and after K come from another. Note that we

for (i in 1:K) {

y[i] ~

}

for (i in (K + 1):N) {

y[i] ~

}

since the index for a loop cannot be a random quantity. Instead we can use the step function to set up an indicator as to which set each observation belongs to:

for (i in 1:N) {

ind[i] <- 1 + step(i - K - 0.01) # will be 1 for all i <= K, 2 otherwise

y[i] ~

}

This is illustrated in

One way to do this is to repeat the analysis under different prior assumptions, but within the same simulation in order to aid direct comparison of results. Assuming the consequences of K prior distributions are to be compared:

a) replicate the dataset K times within the model code;

b) set up a loop to repeat the analysis for each prior, holding results in arrays;

c) compare results using the

The example

Suppose we have an unknown Binomial denominator for which we wish to express a prior distribution. It can be given a Poisson prior but this makes it difficult to express a reasonably uniform distribution. Alternatively a continuous distribution could be specified and then the 'round' function used. For example, suppose we are told that a fair coin has come up heads 10 times - how many times has it been tossed?

model {

r <- 10

p <- 0.5

r ~ dbin(p, n)

n.cont ~ dunif(1, 100)

n <- round(n.cont)

}

n.cont 21.08 4.804 0.07932 13.31 20.6 32.0 1001 5000

Assuming a uniform prior for the number of tosses, we can be 95% sure that the coin has been tossed between 13 and 32 times. A discrete prior on the integers could also have been used in this context.

Suppose we observe the following data on three individuals:

Person 1: 13.2

Person 2: 12.3 , 14.1

Person 3: 11.0, 9.7, 10.3, 9.6

There are three different ways of entering such 'ragged' data into WinBUGS:

y[,1] y[,2] y[,3] y[,4]

13.2 NA NA NA

12.3 14.1 NA NA

11.0 9.7 10.3 9.6

END

or list(y = structure(.Data = c(13.2, NA, NA, NA, 12.3, 14.1, NA, NA, 11.0, 9.7, 10.3, 9.6), .Dim = c(3, 4)).

A model such as y[i, j] ~ dnorm(mu[i], 1) can then be fitted. This approach is inefficient unless one explicitly wishes to estimate the missing data.

y[] person[]

13.2 1

12.3 2

14.1 2

11.0 3

9.7 3

10.3 3

9.6 3

END

or list(y = c(13.2, 12.3, 14.1, 11.0, 9.7, 10.3, 9.6), person = c(1, 2, 2, 3, 3, 3, 3)).

A model such as y[k] ~ dnorm(mu[person[k]], 1) can then be fitted. This seems an efficient and clear way to handle the problem.

list(y = c(13.2, 12.3, 14.1, 11.0, 9.7, 10.3, 9.6), offset = c(1, 2, 4, 8))

and lead to a model containing the code

for (k in offset[i]:(offset[i + 1] - 1)) {

y[k] ~ dnorm(mu[i], 1)

}

The danger with this method is that it relies on getting the offsets correct and they are difficult to check. See the

Suppose we observe some data that we do not wish to contribute to the parameter estimation and yet we wish to consider as part of the model. This might happen, for example:

a) when we wish to make predictions on some individuals on whom we have observed some partial data that we do not wish to use for parameter estimation;

b) when we want to use data to learn about some parameters and not others;

c) when we want evidence from one part of a model to form a prior distribution for a second part of the model, but we do not want 'feedback' from this second part.

The "cut" function forms a kind of 'valve' in the graph: prior information is allowed to flow 'downwards' through the cut, but likelihood information is prevented from flowing upwards.

For example, the following code leaves the distribution for theta unchanged by the observation y.

model

{

y <- 2

y ~ dnorm(theta.cut, 1)

theta.cut <- cut(theta)

theta ~ dnorm(0, 1)

}