Abstract
Background
Most Bayesian models for the analysis of complex traits are not analytically tractable and inferences are based on computationally intensive techniques. This is true of Bayesian models for genomeenabled selection, which uses wholegenome molecular data to predict the genetic merit of candidate animals for breeding purposes. In this regard, parallel computing can overcome the bottlenecks that can arise from series computing. Hence, a major goal of the present study is to bridge the gap to highperformance Bayesian computation in the context of animal breeding and genetics.
Results
Parallel Monte Carlo Markov chain algorithms and strategies are described in the context of animal breeding and genetics. Parallel Monte Carlo algorithms are introduced as a starting point including their applications to computing singleparameter and certain multipleparameter models. Then, two basic approaches for parallel Markov chain Monte Carlo are described: one aims at parallelization within a single chain; the other is based on running multiple chains, yet some variants are discussed as well. Features and strategies of the parallel Markov chain Monte Carlo are illustrated using real data, including a large beef cattle dataset with 50K SNP genotypes.
Conclusions
Parallel Markov chain Monte Carlo algorithms are useful for computing complex Bayesian models, which does not only lead to a dramatic speedup in computing but can also be used to optimize model parameters in complex Bayesian models. Hence, we anticipate that use of parallel Markov chain Monte Carlo will have a profound impact on revolutionizing the computational tools for genomic selection programs.
Background
In recent decades, Bayesian inference has been increasingly used for analysis of complex statistical models, in part because of increased availability and performance of personal computers and workstations. However, such models are generally not analytically tractable and, hence, computationally demanding numerical techniques are inevitably required. This is especially true of Bayesian computation for genomeenabled prediction and selection, which aims at using wholegenome molecular data to predict the genetic merit of candidate animals for breeding purposes [1]. Typically, implementation of a highdimensional model based on Markov chain Monte Carlo (MCMC) techniques is notoriously intensive in computing and often requires days, weeks, or even months of CPU (Central Processing Unit) time on personal computers and workstations [2]. Therefore, in order to overcome such computational burden, parallel computing becomes appealing [3,4].
Parallel computing operates on the principle that a large problem can be split into smaller components and solved concurrently (i.e., ”in parallel”), each on a separate processor (or CPU core) [5]. An instance of a computer program and its activities that are taking place on each processor is referred to a process. Thus, parallel computing involves activating multiple processes that concurrently carry out related computing jobs and combining results by the main “controlling” process. Parallel computing can be achieved by programming with C, C++, or Fortran, e.g., using the MPI (Message Passing Interface) library to handle interprocess communication [6]. Highperformance computing communities have developed parallel programs for decades but were previously limited to programs running on expensive supercomputers. In the past twenty years, interest in parallel computing has grown markedly due to physical constraints that prevent frequency scaling [5] and to the need to handle datasets of unprecedented dimensionalities that are being generated [7]. Parallel computing has now become a dominant paradigm in current computer architectures, mainly in the form of multicore processors [8].
Parallel MCMC methods have recently been adopted in statistics and informatics [4,9] and in image processing [10] but they have not received much attention in animal breeding and genetics. There are several reasons for this gap. First, MCMC algorithms are seemingly serial, and parallelism is not as straightforward as one would expect. Second, many intensive computational tasks in breeding and genetics applications have been handled via some simple data parallelism, implemented through the “multipletasking” mechanism provided by multicore Linux workstations. Multipletasking allows each processor to switch between tasks being executed on it, without having to wait for each task to finish, but this type of “parallel” computing is not scalable with the number of jobs. Recently, parallel MCMC algorithms and strategies have become a focal point for scientific computing in the postgenome era [4]. This is largely due to the need to handle genomic datasets of unprecedented sizes, such as genomewide dense markers or sequences for genomeenabled selection programs [2]. With a set of wholegenome markers (say 50K SNP markers or higher density) in a model, the computing task is highly challenging, particularly with sophisticated Bayesian models via MCMC implementation [1,1113].
In this paper, we present a technical description of parallel MCMC methods in the context of animal breeding and genetics. These algorithms typically fall into two categories: running multiple MCMC chains or parallelization within a single chain; some variants of these algorithms are discussed as well. A major purpose of this paper is to advocate the use of parallel MCMC methods, hence infusing highperformance computing technologies into animal selection programs in the postgenome era.
Methods
Parallel Monte Carlo methods
We start with parallel Monte Carlo methods, as a prelude to parallel MCMC. In practice, many statistical problems involve integrating over hundreds or even thousands of dimensions but usually these problems are not analytically tractable. Instead, Monte Carlo simulation methods [14] can be used to tackle highdimensional integrals. Standard Monte Carlo integration algorithms distribute the evaluation points uniformly over the integration regions.
Parallel computing for evaluating integrals
To begin, consider the following integral
for some highdimensional θ with density
Simple Monte Carlo algorithms proceed by averaging large numbers of values that are generated independently of each other. Obviously, Monte Carlo simulation is parallel in computing because it can be conducted concurrently. By parallel computing, the entire set of samples can be divided among the available CPU cores and then each core generates a portion and summarizes its local samples. After all processors have finished their tasks, a master program summarizes all the partial data and outputs the final result.
Suppose that there are K CPU cores that generate a total of T samples, each handling an equal portion of these samples. For simplicity, we assume that T is divisible by K, such that the quotient (m = T/K) is an integer. Then, parallel Monte Carlo simulation proceeds as follows [3]:
➔ Process 0 (master process):
(a) computes and passes m to each process.
➔ Each (slave) process (say j):
(a) simulates m independent realizations of
(b) computes
➔ Process 0 (master program):
(a) sums S_{j} and generates the final sum
(b) computes the Monte Carlo estimate as
Note that, in this example, the master process does not involve computing the sum of a portion of the data but it actually can. Also, note that each process is given the same number of samples. This works well if all CPU cores process the data at the same speed or approximately so. In practice, however, clock frequencies (i.e., computing speed) can vary markedly among processors. Hence, it can be more effective for each processor (or CPU core) to process a different number of samples, roughly proportional to its computing speed, and then let the master program compute the weighted average of all samples obtained from the K cores.
Parallel computing of singleparameter models
A singleparameter model can serve as a building block for Bayesian modeling [15]. Consider a normal distribution with known mean μ and unknown variance
where
it can be shown that the posterior density of
Hence, the posterior mean of
To show why the algorithm of parallel Markov chain simulation applies to parallel
computing of a singleparameter model, consider equation (1). For this singleparameter
normal model, for example, the marginal posterior expectation of
Clearly, (5) implies a similar Monte Carlo implementation: if n samples of
Parallel computing of multipleparameter models
Many models involve more than one unknown. Although many parameters are involved, conclusions are often drawn about one or only a few parameters at a time. In Bayesian analysis, the aim is to obtain the marginal posterior distribution of each parameter of interest. Often, we can construct the joint posterior distribution of all unknowns and then integrate this distribution over the unknowns that are not of immediate interest, leading to the desired marginal distribution of the parameter of interest.
Now, consider the normal distribution (2) but with both mean and variance unknown.
Assuming prior independence of location and scale parameters, a vague prior density
for μ and
Then, it can be shown that the marginal posterior distribution of
where
Therefore, posterior samples for
➔ Sampling
➔ Sampling
Parallel Markov Chain Monte Carlo
Analytical solutions are not always available for most multipleparameter models.
Instead, MCMC simulation can be used to draw samples from the joint posterior distribution
and then evaluate sampled values for the parameter(s) of interest while ignoring the
values of other unknowns. MCMC methods are a variant of Monte Carlo schemes in which
a Markov chain
Parallel MCMC by running multiple chains
A naive yet natural approach to parallel MCMC is simply to generate several independent Markov chains on different processors and then combine results appropriately [17,18]. Given that running multiple chains is simple and that they scale well with the number of available processors (or CPU cores), this type of “multiplechain” parallelism is usually the strategy to strive for in the first instance.
Assume that we want to estimate some target distribution
An appealing advantage of running multiple chains is that these processes can be conducted
concurrently with minimal coordination among tasks, as in the case of parallel Monte
Carlo simulation. However, unlike parallel Monte Carlo simulation, a major concern
with running multiple MCMC is that the overall reduction in runtime from parallelism
can be limited by the portion of each chain to be discarded in the beginning of MCMC
sampling for convergence purposes (i.e., burnin). If every chain has to spend a significant
proportion of its time in burnin, this would place serious limitations on the performance
of the algorithm, because it would not scale well with an increasing number of processors
[4]. According to Amdahl’s (1967) law [20], if some portion ρ of steps, for
where n is the number of iterations after burnin. Thus, parallel MCMC computing by virtue of running multiple chains is rewarding only when ρ is small. However, if ρ is large, the gain in computation through running multiple chains instead of a single long chain can be very disappointing.
Although running multiple Markov chains is theoretically straightforward, chains are not necessarily ergodic. Hence, some variant multiple MCMC methods have been proposed. For example, samples from multiple Markov chains may be confined to isolated modes if the target distribution is multimodal, or the chains may mix poorly when there are strong correlations between variables. Unfortunately, the latter is a common problem of Gibbs samplers [21]. Hence, pooling samples from multiple short chains may not necessarily give a better representation of p(X) than using a single long chain. If several chains are drifting to disparate modes, they will tend to be strongly influenced by the chains that they confine, because the weights will not necessarily be proportional to their relative masses.
Several strategies have been proposed for handling the aforementioned issues for single chains, such as adaptive MCMC algorithms [16,22] and tempering [23,24]. Metropoliscoupled MCMC is an algorithm that is related to simulated tempering and tempered transitions [23,24]. It proceeds by simultaneously running a number of different Markov chains that are governed by different (but related) Markov chain transition probabilities. Occasionally, the algorithm “swaps” values between two different chains, with probability governed by the Metropolis algorithm to preserve the stationarity of the target distribution. These swaps can speed up convergence of the algorithm substantially [4]. Craiu et al. [25] targeted the posterior with an ensemble of chains, using the covariance of samples across all chains to adapt the proposal covariance for a set of MetropolisHastings chains. While these multiplechain methods use synchronous exchange of samples to expedite convergence, Murray [26] proposed mixing in an additional independent proposal, representing some hitherto best estimate or summary of the posterior, and cooperatively adapting across chains. The idea is to construct a global best estimate of the posterior at any given step and then mix this estimate as a remote component with whatever local proposal that a chain has adopted. This does not preclude adaptive treatment or tempering of that local proposal. It also permits a heterogeneous blend of remote proposals, so that the ensemble of chains can mix well.
Parallelization within a single chain
By running multiple Markov chains, we often observe that samplers mix poorly and each chain may require a very long burnin time. Hence, it would be preferable to develop parallelism within a single chain, instead of running multiple chains. As mentioned in the previous section, Markov chain simulation is an iterative procedure, in the sense that simulation of the next value of the chain depends on the current value. This creates difficulty for delivering parallelism for a single Markov chain. Nevertheless, we will show that a single chain can be parallelized, subject to assumptions of conditional independence in the model. The key is to identify such steps that can be implemented in parallel.
Consider a Bayesian model with p scale parameters
➔ Master program:
(a) samples a new σ, given realization of θ and the data y, and
(b) distributes the new σ to each process.
➔ Each process (k):
(a) updates a subset of θs that have been assigned to it, conditional on σ and y,
(b) computes summary statistics for the updated
(c) passes the summary statistics back to the master program.
Often, the above algorithm works quite well when the θ are all independent of one another, given σ and y. In practice, however, such independence may not necessarily hold and strategies must be developed to deliver efficient parallel MCMC algorithms given specific dependence between elements [3].
Applications
Parallel simulation for a singleparameter normal model
Consider a normal distribution model with unknown μ and known
If a normal prior is assumed, that is,
where
Intuitively, the posterior mean of θ is expressed as a weighted average of the prior mean (
The example data are average body weight daily gains (ADG) measured on 7670 Angus
cattle. The kernel density of ADG is shown in Figure 1, which approximately suggests a normal distribution. Assume that we know, from previous
studies, that the population variance of ADG is 0.58. In this example, the prior distribution
is assumed to be normal with mean equal to 4.0 and variance equal to 1.0 (these are
just guesses of the parameter values in the distribution of ADG). A parallel C program
was used in this analysis (Appendix). To compile the parallel program, say using MPICH2,
type: mpicc singNormMod_Parallel.c –o singNormP –lm [enter]. To conduct computing
in parallel, type: mpirun –np xx ./singNormP [enter], where xx is the number of processors
involved (or CPU cores). To estimate
Figure 1. Kernel density of average body weight daily gain measured in 7668 Angus cattle .
The posterior mean was estimated to be 3.394, which corresponded very well to the sample mean of ADG of the 7670 Angus cattle (Table 1) because the impact of the prior on the posterior could be ignored given the very large sample size. The median and mean agreed well with each other and the first and third quartiles were also very similar (Table 1). These are indications that the posterior distribution of the mean of ADG is symmetric.
Table 1. Posterior summary statistics of average body weight daily gain based on a singleparameter normal model
The purpose of this example was to show parallel computing using the MPI (Message Passing Interface) library. The change in computing time for this example was, however, almost insignificant because sampling from a normal distribution is very quick. In addition, with parallel simulation, interprocess communication requires some extra time as overhead, which offset gains from parallel computing.
MPI is a languageindependent communication protocol used to program parallel computers that is extensively used for highperformance computing. More specifically, MPI is a library of routines for creating parallel programs e.g., in C or Fortran 77, that allow users to create programs that can run on most parallel computer architectures. (Note that there is a language extension to Fortran90 called High Performance Fortran – HPF, which supports highperformance computing.) In the example code, the MPI library was used to handle interprocess communications in the C program. With MPI, each task can have its own local memory during computation (but multiple tasks can reside in the same physical machine and/or an arbitrary number of machines). Typically, tasks exchange data by sending and receiving messages but data transfer usually requires cooperation among processors, that is, a “send” operation must have a matching “receive” operation.
A few details about this program in the Appendix are described in the following. MPI_Comm_rand() is used to find out the ID of all participating processors and MPI_Comm_size() is used to get the number of participating processors. A common pattern of interaction among parallel processors is to use MPI_Send() and MPI_Receive() to allocate work among them. In the present example, however, this was done in a slightly different manner. MPI_Bcast() is used to send common parameter values (e.g., number of simulation steps) to all participating processors. Then, after each processor has finished its work, the subroutine MPI_Reduction() is used to sum up the posterior values from all processors. Subroutine MPI_Reduction() collects data from all processors, reduces the data to a single value (e.g., by summation), and then stores the results on the master process (and on all processes as well). There are several predefined operations that MPI_Reduction() can provide. In addition to summation, it can also conduct multiplication, and find minimum or maximum values. Finally, the master processor computes the means and standard deviation (and other posterior statistics, when relevant) for the mean of the normal model. Note that, in this illustration, we used sequential functions to generate random numbers (http://apps.nrbook.com/c/index.html webcite), with process ID used as the random number seed. Preferably, one can use a parallel random number generator, such as the Scalable Parallel Random Number Generators (SPRNG) Library (http://sprng.cs.fsu.edu/ webcite).
Running multiple chains for Bayesian LASSO modeling
In this example, we show how to parallelize multiple chains for the Bayesian LASSO (Least Absolute Shrinkage and Selection Operator) regression model [12,27]. In the wholegenome prediction context, consider the following linear model:
where y is a vector of phenotypes, μ is an effect common to all observations, β is a vector of unknown marker effects, X is an incidence matrix, and e is a vector of residuals. The prior specification follows de los Campos et al. [12] but without the terms for additive genetic effects. An R package, BLR, was used to implement the Bayesian LASSO model [28].
The dataset used here consisted of 147 Angus cattle, each genotyped for 37 892 polymorphic SNP (Single Nucleotide Polymorphisms) markers and with estimated breeding values (EBV) for marbling score as response variable. In addition to running a single long chain of 100 000 iterations (after a burnin of 1000 iterations), we also ran 10 chains, each consisting of 10 000 iterations (after a burnin of 1000 iterations). All jobs were submitted and run on a Condor cluster at the University of Wisconsin – Madison [29]. This cluster provides 1860 cores for distributed parallel computing. Among them, 1847 run a Linux operation system and the remaining a Windows operation system. Memory size ranges from 256M to 214G: 7.04% (<1G), 67.58% (13G), 22.69% (48G), and 2.67% (>10G). A Perl script was used, that installs the R system and required libraries (such as the SuppDists package) onto remote nodes prior to the computing and then executes the Bayesian LASSO program. This Perl script served as the executable in the Condor job batch file.
For the multiple chains approach, each started with overdispersed initial values
(and with different seeds for the random number generator), and the chains converged
after a certain number of iterations. Markov chain Monte Carlo convergence was examined
using posterior samples of the residual variance collected from the first 4500 iterations
of each chain. Trace plots of posterior samples of the residual variance showed that
most chains tended to stabilize after 1000 iterations, and all approached 0.0034 (Figure
2a), which corresponds well to estimated residual variance in this example. The trace
plot of the shrink factor
Figure 2. Markov chain Monte Carlo convergence diagnosis. (a) Trace plots of posterior samples of residual variance obtained within 4500 iterations
from each of the 10 chains; (b) trace plot of shrink factor
The parallel computing took between 141 and 178 min to complete each of the 10 processes. The differences were due to varying CPU speeds and workloads on these computer nodes. In contrast, running a single chain with 100 000 iterations (after a burnin of 1000 iterations) on a Linux workstation with similar specifications took 1386 min (Figure 3a). Thus, the reduction in runtime from parallel computing was approximately 7.78 fold. Posterior estimates of the model parameters were similar between the two computing approaches (data not presented).
Figure 3. Parallel vs. sequential computing of a Bayesian LASSO model for genomeenabled prediction of genetic merit. (a) Comparison of computing time; (b) expected speedup by parallel computing with the chain length equal to ten times the burnin length for the Markov chain.
When running multiple chains, the reduction in runtime is limited by the time required for burnin. Let b denote the number of burnin iterations that is required, and n be the number of iterations after the burnin. Then, in a serial implementation, the chain will consist of b + n iterations in total. Let there be K processes running chains in parallel, each taking on an equal length of Markov chain (i.e., b + n/K iterations). Assuming each iteration takes the same amount of processing time, the reduction in runtime is given by:
The above is an alternative form of (10) with
Practically, burnin time is related to the mixing rate of the chain, which is related
to the total length of the Markov chain. Let n = 10b, which is a useful ruleofthumb in most practical situations. Then, equation (15)
depends only on parameter K. Hence, we have
Thus, when running multiple chains, each with a significant length of burnin, the
speedup does not scale well with the number of available CPU cores. Let
Parallel BayesCpC for wholegenome prediction
Finally, we show a real application of parallel computing on genomeenabled prediction
in beef cattle. The computing was implemented with a highthroughput computing pipeline
called parallelBayesCpC [30]. This is a highthroughput computing package and a member of the WGSE (WholeGenomeenabled
Selection and Evaluation) family [31] of distributed highthroughput computing pipelines. In computing, a pipeline is a
set of data processing elements connected in series (i.e., the output of one element
is the input of the next one) and the elements of a pipeline can be executed in parallel
or sequentially. Typically, pipelining increases the computing throughput. In the
context of wholegenome prediction, the pipeline that we have developed can automate
all steps involved in the computing and decision making for wholegenome prediction,
which includes and is not limited to data input and quality control, model feature
selection (FS) if applicable, postFS statistical inference and crossvalidation (CV),
and output and documentations (Figure 4a). The parallelBayesCpC package is so named because it uses the BayesCπ model for
FS and the BayesC
Figure 4. Whole genomeenabled selection and evaluation (WGSE) pipelines and parallel BayesCpC. (a) Graphic illustration of the WGSE pipelining; (b) workflow; (c) Condor SOAR webpage of the parallel BayesCpC pipeline.
For simplicity of illustration, consider a linear model with only the overall mean and marker effects:
where
Here,
With a subset of, say
Typically, Kfold CV is often used to evaluate predictive models, in which the whole dataset is divided into K portions of approximately equal size. The model is then trained in the set of K1 portions of the data and predicted in the remaining one portion. Portioning of training and testing sets is then rotated K times in each CV experiment. Furthermore, each CV experiment can be randomly replicated a number of times in order to increase the stability of model evaluation (but we did not do that in the present study).
As an example, we used the parallel BayesCpC package to select the optimal number of SNP to predict rib eye area in a beef cattle population. The data consisted of 2919 animals each with estimated breeding values for rib eye area and genotypes obtained from the Illumina 50K SNP Beadchip. After data editing and data quality control, 46 723 polymorphic SNP markers were used. The optimal panel search started with the top 50 SNP markers according to the posterior model probability for having a nonzero effect for this marker, and then increased from the top 100 to the top 3000 markers at an increment of 100 markers each time. We did not exhaust all possible panels beyond 3K because the prediction accuracy showed a constant trend of decrease after the panel reached 1000 SNP.
In the parallelBayesCpC package, MPI is used for data quality control and parallel MCMC is employed by both FS and CV. In our example, distributed jobs were submitted to a local Condor pool with 128 cores. These are Linux workstations, with Intel® Xeon® CPU (2.67 GHz), > 8G memory per slot and a cache size of 12 288 KB. The submit machine is also a Linux workstation with Intel® Xeon® CPU (3.00 GHz), 16G memory and 6144 KB cache size. Each parallel MCMC chain for feature selection consisted of 10 000 iterations after a burnin of 2000 iterations, thinned every onetenth. Each parallel MCMC for CV consisted of 25 000 iterations, after a burnin of 5000 iterations, thinned every onetenth.
We examined three computing strategies to search for an optimal panel size for predicting genetic merit using SNP markers. The first strategy executed 30 metajobs in parallel, each consisting of one round of parallel FS jobs using all SNP markers and one round of parallel postFS inference and CV for a specific panel (X = 50, 100, 200, …, 3000, respectively). In the second strategy, one round of parallel FS jobs was executed, followed by 30 rounds of parallel postFS inference and CV jobs conducted sequentially for all panel sizes. Parallel CV jobs for the 500SNP panel started only when parallel CV jobs for the 400SNP jobs had finished. The third strategy consisted of 30 metajobs executed in series, with each metajob consisting of one round of parallel FS jobs using all SNP markers and one round of parallel postFS inference and CV for a specific panel. The difference between the second and the third strategy is that different optimal panel sizes were selected based on the same set of FS results with the second computing strategy but each panel was selected based on a different set of FS results with the third computing strategy.
There were very significant differences in computing time between the three computing strategies (Figure 5a). The first strategy consumed the least time (mostly completed within 12 h), because it used more features of parallel computing, but it also required far more slots for computing (in total 90 slots were needed). The second strategy avoided the use of many slots because it ran only one round of parallel FS jobs and then executed 30 rounds of parallel postFS inference and CV jobs sequentially based on the same set of FS results. We ran three parallel chains for FS and also used 3fold CV to evaluate prediction accuracy and, hence, only three slots were needed for this strategy. Because CV jobs on a subset of markers typically ran much faster than a FS job on all markers, it was computationally efficient to run these CV jobs sequentially. The computing time necessary for the second strategy was approximately two times greater than for the first strategy. However, the third strategy consumed the greatest amount of time (over two weeks). With this strategy, jobs for different panel sizes were executed in series but FS jobs and postFS inference and CV jobs for each panel size were executed in parallel. If all these jobs were executed sequentially, the computing time necessary would exceed one month and half, and this is definitely not optimal. Comparatively, the first computing strategy was twice as fast as the second strategy and 29 times faster than the third strategy.
Figure 5. Computing time by three parallel computing strategies (a) and prediction correlations by the first strategy (b). Three parallel computing strategies were used in search of optimal SNP panel sizes for predicting genetic merit. The first strategy executed 30 metajobs in parallel, each consisting of one round of parallel feature selection (FS) jobs using all SNP markers and one round of parallel postFS inference and CV for a specific panel (X = 50, 100, 200, …, 3000, respectively); In the second strategy, one round of parallel FS jobs was executed, followed by 30 rounds of parallel postFS inference and CV jobs conducted sequentially for all panel sizes (P1_A); The third strategy consisted of 30 metajobs executed in series, with each metajob consisting of one round of parallel FS jobs using all SNP markers and one round of parallel postFS inference and CV for a specific panel (Pn_A).
Despite the differences in computing times, predictions obtained with the three computing scenarios were highly comparable. The correlation between estimated breeding values for rib eye area and their fitted values in the training set (referred to as fitting accuracy hereafter) increased almost monotonically with panel size, until it plateaued with a panel size of around 2000 SNP (Figure 5b). However, the correlation between the estimated breeding values and their predicted values in the testing set (referred to as predictive accuracy hereafter) reached its peak (0.8886) with a panel size of 1000 SNP, and then went down slightly. The highest predictive accuracy was observed with 500 to 1500 selected markers. The decrease in predictive accuracy with > 1000 SNP possibly reflected overfitting, which, in this case, occurred much before the panel size exceeded the training population size (i.e., around 2000 animals). Hence, with Bayesian regression models, prediction using more SNP may not necessarily give better results than prediction using a smaller panel. A model that describes the training set well does not necessarily yield the best predictions when generalized to the population. This is referred to as poor generalization in machine learning [33]. The fitting and prediction accuracies are illustrated in Figure 6 for various panel sizes.
Figure 6. Prediction accuracies obtained from different panel sizes. (a) 100 SNP; (b) 1000 SNP; (c) 2000 SNP; (d) 3000 SNP.
In the Bayes CpC procedure, the BayesC π model postulates that a portion π of all SNP have zero effect. In a highdensity SNP panel, π is typically expected to be large, meaning that the portion of “signal” SNP, 1
Figure 7. Histogram of posterior estimates of 1π in the BayesCπ model for feature selection. The results were obtained from feature selection using computing strategy II, that is, one round of parallel FS jobs was then executed, followed by 30 rounds of parallel postFS inference and CV jobs conducted sequentially for all panel sizes.
With the Bayesian regression models explored here, feature selection may be important since a model with all SNP does not necessarily give the best predictions. This situation is unlike ridge regression best linear unbiased prediction [34] or prediction using the GBLUP method [35], for which a model with all markers would typically be favored. While selecting models of varying dimensions may be an issue to explore, it brings tremendous challenges to computing, particularly when the dataset is large. In this regard, highperformance computing offers a markedly competitive edge, not only in reducing computing time but also in tuning optimal models for wholegenome prediction.
Discussion
To date, almost all statistical software packages for animal breeding and genetics have been developed for serial computation. In such programs, only one instruction is executed at a time and after that instruction is finished, the next instruction begins. Hence, serial computing performance depends heavily on the speed (clock frequency) of CPU, and the runtime of a program is approximately equal to the number of instructions multiplied by the average time per instruction. Keeping everything else constant, higher clock frequency leads to faster computing speed and thus decreased runtime for all computationbounded programs [36]. This was the situation with the performance enhancement of microprocessors based on a single CPU from the 1980s to the early 2000s. However, rate of improvement has slowed down since 2003 due to hardware limitations incurred by energy consumption and heat dissipation. On the other hand, parallel computing has gained impetus as a result of increasingly available multiplecore computers, computer clusters, and networking and has been referred to as the concurrency revolution [37]. In theory, multiple threads of execution can cooperate to complete the work faster than a serial setting.
Parallel computing uses multiple processing elements concurrently to solve a problem.
To implement parallel computing, one first needs to break the problem into discrete
“chunks” of work, so that they can be distributed to run on multiple processors. This
is known as task decomposition or partitioning. The next fundamental step in designing
the parallel algorithm involves identifying independencies that are assumed explicitly
in the model. Without loss of generality, let P_{i} and P_{j} be two program jobs, where i < j indexes the order of execution. Bernstein’s conditions [38] can be used to identify whether or not the two jobs are independent and can be executed
in parallel [39]. Let I_{i} and O_{i} be the input variables and output variables, respectively, of P_{i}. Likewise, the same definitions hold for I_{j} and O_{j} of P_{j} . Then, P_{i} and P_{j} are independent if they satisfy (1)
MCMC algorithms have revolutionized the application of Bayesian inference, because it tackles a large range of complex inferential problems that were previously not considered possible, tractable. In the meantime, statisticians are becoming ever more ambitious in the range (complexity) of models they consider and MCMC algorithms for large complex models often require enormous amounts of computing power. Consequently, effective exploitation of parallel algorithms is of high relevance to Bayesian computation. A difficulty, however, is that MCMC algorithms are serial by nature and do not easily migrate onto a parallel system. Nevertheless, various strategies can be used to design parallel MCMC methods. The key is to identify steps with data independence or conditional independence on which parallelism can reside. Several algorithms or strategies exist for running parallel MCMC, which is straightforward and it requires minimal interprocess communication. However, poorly mixing MCMC algorithms with long burnin periods are not ideally suited to this situation, because a long period of burnin must be repeated on every available CPU core. Thus, it is often desirable to explore strategies for parallelism within single chains.
The actual performance of parallel MCMC depends on several issues. Among them, interprocess communication is a primary factor. Many parallel applications require processes to share data with each other, which is known as intertask communication and implies overhead. Intertask communication can offset the gain in computing speed from parallel computing, because it frequently requires some type of synchronization between tasks, causing processes to spend time “waiting”. In the worst case, competing communication traffic can saturate the available network bandwidth, leading to poor parallel computing performance. Thus, there is always a need to balance the distribution of work among tasks. There is no simple rule for this type of load balancing. Ideally, all tasks are to be kept busy all of the time, so that task idle time is minimized [9,10].
The ratio of computation to communication is qualitatively measured by the concept of granularity (https://computing.llnl.gov/tutorials/parallel_comp/ webcite). On one hand, a low computation to communication ratio (finegrain parallelism) facilitates load balancing, as relatively small amounts of computational work are done between communication events but this can imply high communication overhead and less opportunity for performance enhancement, because communication and synchronization between tasks may take longer than the computation. On the other hand, a high computation to communication ratio (coarsegrain parallelism) allows more opportunity for performance enhancement; because relatively large amounts of computational work are done between communication and synchronization events. However, it is difficult to balance loads efficiently with coarsegrain parallelism and computing time may differ dramatically between computer cores. Therefore, there is a tradeoff between computing and communication and the optimal granularity depends on the problem at hand. In most parallel MCMC problems, it is advantageous to have coarse granularity because the overhead associated with communication and synchronization is high relative to execution speed, but finegrain parallelism can sometimes help reduce overhead due to load imbalance.
Conclusions
In this paper, we have shown the principles and examples of parallel MCMC, with applications to wholegenome prediction of breeding values. Parallel computing operates on the principle that a large problem can be split into smaller components and solved concurrently (i.e., “in parallel”), each on a separate processor (or CPU core). In the context of parallel MCMC, two basic algorithms exist: running multiple chains and parallelism within a single chain, yet some variants can be useful as well. In principle, all Bayesian models can be parallelized in computing but the associated algorithms and strategies may differ, leading to varied computing efficiencies. Although many technical details have yet to be explored, we expect that the use of parallel MCMC methods will revolutionize computational tools for research and breeding programs for animals in the postgenome era.
Appendix
(a) Example C code using MPI for parallel simulation of a singleparameter normal model with unknown mean and known variance
/* This is an example C program using MPI for parallel computing of *
* a singleparameter normal model with unkown mean and known *
* variance. For illustration purpose, the step for data input is omitted. *
* Instead, the sample mean and standard deviation, as well as the prior *
* mean and standard are used. *
* Contact: XL Wu, nick.wu@ansci.wisc.edu; UWMadison, 09122011 */
#include < stdio.h>
#include < math.h>
#include < mpi.h>
#include “ranNum.h”
main(int argc, char **argv)
{int proc_id, root_process, nprocs, ierr, niters, i;
double xi, xi2, sum, psum, sum_xi2, psum_xi2;
double tar0, tar1, sdn, tarn, varn, mun, mumu, sdmu;
MPI_Status status;
/* Prior mean and standard deviation */
double mu0 = 4.000;
double sd0 = 1.000;
/* sample size, mean and variance */
int nind = 7670;
double mu1 = 3.394;
double sd1 = 0.580;
/* compute posterior statistics */
tar0 = 1.0/(sd0*sd0);
tar1 = (1.0*nind)/(sd1*sd1);
varn = 1.0/(tar0 + tar1);
sdn = sqrt(varn);
mun = varn * (tar0*mu0 + tar1*mu1);
/* Parallel simulation begins, starting from here */
/* process 0 as the root process. */
root_process = 0;
/* Replicate this process to create parallel processes. */
ierr = MPI_Init(&argc, &argv);
/* Allocate memory for random seed variable*/
long* idum;
MPI_Alloc_mem(sizeof(long), MPI_INFO_NULL, &idum);
/* Find out process ID and number of participating processes. */
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &proc_id);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
/* Root process gets the number of simulation steps */
if(proc_id == root_process) {
printf(“Please enter the number of simulation steps: ”);
scanf(“%i”, &niters);}
/* Broadcast the number of simulation steps to all participating processes */
ierr = MPI_Bcast(&niters, 1, MPI_INT, root_process,MPI_COMM_WORLD);
/*process id as the random number seed */
*idum = proc_id;
/* Each process computes a partial sum of simulated values */
psum = 0;
psum_xi2 = 0;
for(i = proc_id + 1; i < niters + 1; i + = nprocs) {
xi = mun + sdn * randomnormal(idum);
xi2 = xi * xi;
psum = psum + xi;
psum_xi2 = psum_xi2 + xi2;}
printf(“proc %i computes: %f\n”, proc_id, (float)psum);
/* Do a reduction in which all partial sums are combined into the grand sum */
ierr = MPI_Reduce(&psum, &sum, 1, MPI_DOUBLE,
MPI_SUM, root_process, MPI_COMM_WORLD);
ierr = MPI_Reduce(&psum_xi2, &sum_xi2, 1, MPI_DOUBLE,
MPI_SUM, root_process, MPI_COMM_WORLD);
/* Finally, the root process prints posterior mean and standard error of mu. */
if(proc_id == root_process) {
mumu = sum / niters;
sdmu = sqrt((sum_xi2  niters*mumu*mumu)/(niters1));
printf(“The posterior mean and variance of mu is %f and %f\n”, (float)mumu,(float)sdmu);}
/* Close down this processes. */
ierr = MPI_Finalize();}
(b) Condor job batch files (Note: these Condor scripts were written automatically be a R scripts based on a input parameter file. Using the BayesCpC package, a user does not need to write this type of Condor job batch files.)
# Condor DAGMan job
JOB input job_InputData
JOB selection job_Selection
SCRIPT POST selection /usr/local/wgse_beta/V2/cmd_postscript_SelectionSummary
JOB validation job_Validation
SCRIPT POST validation /usr/local/wgse_beta/V2/cmd_postscript_OutputResults
PARENT input CHILD selection
PARENT selection CHILD validation
# Data input and quality control
Universe = vanilla
Executable = /usr/local/wgse_beta/V2/rungeneric.pl
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p1_data_input_n_processing.R ‐‐unique = input
Log = step_input.log
Output = step_input.out
Error = step_input.error
notification = NEVER
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files =(data path and files are omitted), /home/nickwu/BayesCpC/V2_UW/data/datafile.csv, /usr/local/wgse_beta/V2/rungeneric.pl, /usr/local/wgse_beta/V2/SLIBS.tar.gz, /usr/local/wgse_beta/V2/cmd_data_input_n_processing, /usr/local/wgse_beta/V2/p1_data_input_n_processing.R,
Queue
# Feature selection
Universe = vanilla
Executable = /usr/local/wgse_beta/V2/rungeneric.pl
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_1 1
Log = step_validation.$(process).log
Output = step_validation.$(process).out
Error = step_validation.$(process).error
notification = NEVER
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = .linkPar, /usr/local/wgse_beta/V2/rungeneric.pl, /usr/local/wgse_beta/V2/SLIBS.tar.gz, /usr/local/wgse_beta/V2/cmd_cross_validation, /usr/local/wgse_beta/V2/p4_BayesCpC_validation.R,
Queue
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_2 2
Queue
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_3 3
Queue
# PostFS inference and crossvalidation
Universe = vanilla
Executable = /usr/local/wgse_beta/V2/rungeneric.pl
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_1 1
Log = step_validation.$(process).log
Output = step_validation.$(process).out
Error = step_validation.$(process).error
notification = NEVER
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = .linkPar, /usr/local/wgse_beta/V2/rungeneric.pl, /usr/local/wgse_beta/V2/SLIBS.tar.gz, /usr/local/wgse_beta/V2/cmd_cross_validation, /usr/local/wgse_beta/V2/p4_BayesCpC_validation.R,
Queue
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_2 2
Queue
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_3 3
Queue
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
XLW worked out the conception and design of this study, developed the pipeline, analyzed the results and wrote the article. CS and TMB helped with programming, data analysis and proofread the article. KAW, GJMR, NDLG, and DG were involved in the design of this study, discussion of the results, and proofreading of this article. All authors read and approved the final manuscript.
Acknowledgements
This research was supported by the University of Wisconsin (UW) Foundation, a genomic selection grant by Merial Ltd., and National Research Initiative competitive grant no. 20093520505099 from the USDA National Institute for Food and Agriculture Animal Genome Program (Washington, DC, USA). Drs. Stewart Bauck and Brent Woodwart were partially involved in this study. Dr. Jeremy F. Taylor is acknowledged for kindly providing the beef cattle data used in this study. William Taylor at the UW Center for HighThroughput Computing (CHTC) is thanked for his excellent technical assistance in running multiple chains for Bayesian LASSO on the CHTC Condor pool and for helping design the SOAR interface for the BayesCPC. Special thanks go to the two anonymous reviewers and the editors (Drs. Helene Hayes and Jack Dekkers) whose edits and suggestions have improved this paper a lot. In particular, Dr. Jack Dekkers is acknowledged who has thoroughly and critically revised the whole manuscript; his inputs are not only editorial but highly technical as well. KW acknowledges financial support from the National Association of Animal Breeders (Columbia, MO).
References

Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps.
Genetics 2001, 157:18191829. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu XL, Beissinger TM, Bauck S, Woodward B, Rosa GJM, Weigel KA, de Leon Gatti N, Gianola D: A primer on highthroughput computing for genomic selection.
Front Genet 2011, 2:4. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wilkinson DJ: Parallel Bayesian computation. In Handbook of Parallel Computing and Statistics. Edited by Kontoghiorghes EJ. Chapman and Hall/CRC, Boca Raton, FL, USA; 2005:477508.

Rosenthal JS: Parallel computing and Monte Carlo algorithms.

Almasi GS, Gottlieb A: Highly Parallel Computing. BenjaminCummings publishers, Redwood City; 1989.

Gropp W, Lusk E, Skjellum A: Using MPI: Portable Parallel Programming with the Message Passing Interface. 2nd edition. The MIT Press, Cambridge; 1999.

Benson DA, KarschMizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank.
Nucleic Acids Res 2011, 39:D32D37.
Database issue
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA, University of California at Berkeley: The landscape of parallel computing research: A view from Berkeley.
Technical Report No. UCB/EECS‐‐ 2006183 2006.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS2006183.pdf webcite

Brockwell AE: Parallel Markov chain Monte Carlo simulation by prefetching.
J Comput Graph Stat 2006, 15:246261. Publisher Full Text

Parallel Markov chain Monte Carlo computation for varying dimension signal analysis. 2009, 26732677. [Proceedings of the 17th European Signal Processing Conference: 2428 August 2009; Glasgow]

Habier D, Fernando RL, Dekkers JCM: Genomic selection using lowdensity marker panels.
Genetics 2009, 182:343353. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

de los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, Weigel K, Cotes JM: Predicting quantitative traits with regression models for dense molecular markers and pedigree.
Genetics 2009, 182:375385. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhong S, Dekkers JCM, Fernando RL, Jannink JL: Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study.
Genetics 2009, 182:355364. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fishman GS: Monte Carlo: Concepts, Algorithms, and Applications. Springer, New York; 1995.

Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. Chapman and Hall, New York; 2004.

Gilks WR, Roberts GO, Sahu SK: Adaptive Markov chain Monte Carlo through regeneration.
J Am Stat Assoc 1998, 93:10451054. Publisher Full Text

Gelman A, Rubin DB: Inference from iterative simulation using multiple simulations.
Stat Sci 1992, 7:457511. Publisher Full Text

Bradford R, Thomas A: Markov chain Monte Carlo methods for family trees using parallel processor.
Stat Comput 1996, 6:6775. Publisher Full Text

Geyer CJ: Markov chain Monte Carlo maximum likelihood.
In Proceedings of the 23rd Symposium on the Interface: Computing Science and Statistics: 2124 April 1991; Seattle Edited by Keramidas E. 1991, 156163.

Amdahl GM: Validity of the single processor approach to achieving largescale computing capabilities.
In Proceedings of the American Federation of Information Processing Societies: 1416 November; Anaheim 1967, 30:483485.

Geman S, Geman D: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.
IEEE Trans Pattern Anal Mach Intell 1984, 6:721741. PubMed Abstract

Andrieu C, Thoms J: A tutorial on adaptive MCMC.
Stat Comput 2008, 18:343373. Publisher Full Text

Marinari E, Parisi G: Simulated tempering: a new Monte Carlo scheme.
Europhys Lett 1992, 19:451458. Publisher Full Text

Neal RM: Sampling from multimodal distributions using tempered transitions.
Stat Comput 1996, 6:353366. Publisher Full Text

Craiu RV, Rosenthal JS, Yang C: Learn from thy neighbor: Parallelchain and regional adaptive MCMC.
J Am Stat Assoc 2009, 104:14541466. Publisher Full Text

Distributed Markov chain Monte Carlo. [Proceedings of Neural Information Processing Systems Workshop on Learning on Cores, Clusters and Clouds: 11 December 2010; Mt. Currie South]
http://lccc.eecs.berkeley.edu/papers.html webcite

Park T, Casella G: The Bayesian Lasso.
J Am Stat Assoc 2008, 103:681686. Publisher Full Text

Perez P, Delos Campos G, Crossa J, Gianola D: Genomicenabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R.
Plant Genome 2010, 3:106116. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Thain D, Tannenbaum T, Livny M: Distributed computing in practice: the Condor experience.
Concurrency Computat: Pract Exper 2005, 17:323356. Publisher Full Text

Wu XL, Yao C, Long N, Stewart B, Woodward B, Mujibi DFN, Rosa GJ, Weigel KA, Gianola D:
Highthroughput computing for genomeenabled selection – Preliminary deployment of a HTC pipeline for postgenomeera breeding programs. [Proceedings of the Plant and Animal Genome XIX Conference: 1519 January 2011; San Diego]
http://www.intlpag.org/2013/index.php/abstracts/abstractsarchive webcite

Wu XL, Hayrettin O, Duan H, Beissinger T, Bauck S, Woodward B, Rosa GJM, Weigel KA, de Leon Gatti N, Taylor J, Gianola D:
ParallelBayesCpC on OSG: Gridenabled Highthroughput computing for genomic selection in practice. [Proceedings of the Plant and Animal Genome XX Conference: 1418 January 2012; San Diego]
https://pag.confex.com/pag/xx/webprogram/Paper4104.html webcite

Gianola D, Sorensen D: Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes.
Genetics 2004, 167:14071424. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bishop CM: Pattern Recognition and Machine Learning. Springer, New York; 2006.

Piepho HP: Ridge regression and extensions for genomewide selection in maize.
Crop Sci 2009, 49:11651176. Publisher Full Text

Luan T, Woolliams JA, Lien S, Kent M, Svendsen M, Meewissen THE: The accuracy of genomic selection in Norwegian red cattle assessed by crossvalidation.
Genetics 2009, 183:11191126. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hennessy JL, Patterson DA: Computer Architecture: A Quantitative Approach. 3rd edition. Morgan Kaufmann Publishers, San Francisco; 2002.

Roosta SH: Parallel Processing and Parallel Algorithms: Theory and Computation. Springer, New York; 2000.