SEARCH

SEARCH BY CITATION

Keywords:

  • kernel methods;
  • nonparametric methods;
  • parallel machine learning;
  • GPGPU;
  • parallel multidimensional trees;
  • CUDA

Abstract

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

Kernel summations are a ubiquitous key computational bottleneck in many data analysis methods. In this paper, we attempt to marry, for the first time, the best relevant techniques in parallel computing, where kernel summations are in low dimensions, with the best general-dimension algorithms from the machine learning literature. We provide the first distributed implementation of kernel summation framework that can utilize: (i) various types of deterministic and probabilistic approximations that may be suitable for low and high-dimensional problems with a large number of data points; (ii) any multidimensional binary tree using both distributed memory and shared memory parallelism; and (iii) a dynamic load balancing scheme to adjust work imbalances during the computation. Our hybrid message passing interface (MPI)/OpenMP codebase has wide applicability in providing a general framework to accelerate the computation of many popular machine learning methods. Our experiments show scalability results for kernel density estimation on a synthetic ten-dimensional dataset containing over one billion points and a subset of the Sloan Digital Sky Survey Data up to 6144 cores. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013


1.INTRODUCTION

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

Kernel summations occur ubiquitously in both old and new machine learning algorithms, including kernel density estimation [1], kernel regression [2], Gaussian process regression [3], kernel PCA [4], and kernel support vector machines (KSVM) [5]. In these methods, we are given a set of reference/training points equation image, equation image and their weights equation image and a set of query/test points equation image, equation image (analogous to the source points and the target points in FMM literature). We consider the problem of rapidly evaluating, for each qQ, sums of the form:

  • equation image(1)

where equation image is the given kernel.

In this paper, we consider the setting of evaluating f(q;R) on a distributed set of training points/test points. Data may be distributed because: (i) it is more cost-effective to distribute data on a network of less powerful nodes than storing everything on one powerful node and (ii) it allows distributed query processing for high scalability. Each process (which may/may not be on the same node) owns a subset of R and Q and needs to initiate communications (i.e., MPI, memory-mapped files) when it needs a remote piece of data owned by another process. Cross-validation in all of the methods above require evaluating Eq. (1) for multiple parameter values, yielding ��(D|Q||R|) cost. Especially, |Q| and |R| can be prohibitively large so that one CPU cannot handle the computation in a tractable amount of time. Unlike the usual 3 −D setting in N-body simulations, D may be as high as 1000 in many kernel methods. This paper attempts to provide a general framework that encompasses acceleration techniques for a wide range of both low-dimensional and high-dimensional problems with a large number of data points.

Shared/distributed memory parallelism. Achieving scalability in distributed setting requires: (i) minimizing inherently serial portions of the algorithm (Amdahl's law); (ii) minimizing the time spent in critical sections; (iii) overlapping communication and computation as much as possible. To achieve this goal, we utilize OpenMP for shared-memory parallelism and MPI for distributed-memory parallelism in a hybrid MPI/OpenMP framework. Kernel summation can be parallelized because each f(q;R) can be computed in parallel. In practice, Q is partitioned into a pairwise disjoint set of points equation image and a set of batch sums for each Qi proceeds in parallel. We use a query subtree as Qi (Fig. 6) as their spatial proximity makes it more efficient to be processed as a group.

1.1. Our contributions

In this paper, we attempt to marry, for the first time, the best relevant techniques in parallel computing, where kernel summations are in low dimensions, with the best general-dimension algorithms from the machine learning literature. We provide a unified, efficient parallel kernel summation framework that can utilize: (i) various types of deterministic and probabilistic approximations (Table 2) that may be suitable for both low and high-dimensional problems with a large number of data points; (ii) any multidimensional binary tree using both distributed memory (MPI) and shared memory (OpenMP) parallelism (Table 3 lists some examples); and (iii) a dynamic load balancing scheme to adjust work imbalances during the computation. Our framework provides a general approach for accelerating the computation of many popular machine learning methods (see Table 1). Our motivation is similar to that of Liu and Jan Wu [15], where a general framework was developed to support various types of scientific simulations, and is based on parallelization of the dual-tree method [16].

Table 1. Methods that can be sped up by using our framework. Although the parts marked with × can be sped up in some cases by sparsifying the kernel matrix and applying Krylov-subspace methods, computed results are usually numerically unstable. PD and CPD denote positive-definite and conditionally positive-definite respectively.
Methodk(·,·)Train/Batch test
KDE [1]/NWR [2]PDFs√ / √
KSVM [5]/GPR [3]PD kernels× / √
KPCA [4]CPD kernels× / √
Table 2. Examples of approximation schemes that can be utilized in our framework.
ApproximationTypeBasis functionsApplicability
Series expansion [6,7]DeterministicTaylor basisGeneral
Reduced set [5]DeterministicPseudo-particlesLow-rank PD/CPD kernels
Monte Carlo [8,9]ProbabilisticNoneGeneral smooth kernels
Random feature extraction [10]ProbabilisticFourier basisLow-rank PD/CPD kernels
Table 3. Examples of multidimensional binary trees that can be utilized in our framework. If Rule(x) returns true, then x is assigned to the left child (as defined in ref. 14).
Tree typeBound typeRule(x)
kd-trees [11]hyper-rectangle equation imagexisi for 1 ≤ iD, bd,minsibd,max
metric trees [12]hyper-sphere B(c,r), equation image||xpleft|| < ||xpright|| for equation image
vp-trees [13]B(c,r1) ∩ B(c,r2) for 0 ≤ r1 < r2||xp|| < t for equation image
RP-trees [14]Hyperplane aTx = b xTvMedian(zTv : zS)

Outline of this paper. In Section 3, we show how to exploit distributed/shared memory parallelism in building distributed multidimensional trees. In Section 4, we describe the overall algorithm and the parallelism involved. In Section 4.2, we describe how we exchange messages among different processes using the recursive doubling scheme; during this process, we touch briefly upon a problem of distributed termination detection. In Section 4.3, we discuss our static and dynamic load balancing schemes. In Section 5, we demonstrate the scalability of our framework on kernel density estimation on both synthetically generated dataset and a subset of SDSS dataset [17]. In Section 6, we investigate the applicability of general-purpose computing on graphics processing units (GPGPUs) in accelerating kernel summations. In Section 7, we discuss planned extensions.

Terminology. An MPI communicator connects a set of MPI processes, each of which is given a unique identifier called an MPI rank, in an ordered topology. We denote Cworld as the MPI communicator over all MPI processes, and DP the portion of the data D owned by the Pth process. In this paper, we assume that: (i) the nodes are connected using a hypercube topology as it is the most commonly used one; (ii) there are pthread threads associated with each MPI process; (iii) the number of MPI processes p is a power of two, though our approach can be easily extended for arbitrary positive integers p; and (iv) the query set equals the reference set (Q = R, and we denote D as the common dataset and N = |D| the size of the dataset), and D is equidistributed across all MPI processes. Particularly the monochromatic case of Q = R occurs often in cross-validating for optimal parameters in many nonparametric methods.

2. RELATED WORK

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

2.1. Error Bounds

Many algorithms approximate the kernel sums at the expense of reduced precision. The following error bounding criteria are variously used in the literature:

DEFINITION 1: τ absolute error bound: For each f(qi;R) for qiQ, it computes equation image such that equation image.

DEFINITION 2: ε relative error bound: For each f(qi;R) for qiQ, compute equation image such that equation image.

Bounding the relative error is much harder because the error bound criterion is in terms of the initially unknown exact quantity. As a result, many previous methods [18,19] have focused on bounding the absolute error. The relative error bound criterion is preferred to the absolute error bound criterion in statistical applications in which high accuracy is desired. Our framework can enforce the following error form:

DEFINITION 3: (1 − α) probabilistic ε relative/τ absolute error: For each f(qi;R) for qiQ, compute equation image, such that with at least probability equation image.

2.2. Serial Approaches

Fast algorithms for evaluating Eq. (1) can be divided into two types: (i) reduced set methods from the physics/machine learning communities [5] and (ii) hierarchical methods which employ spatial partitioning structures such as octrees, kd-trees [11], and cover-trees [20].

Reduced set methods. Reduced set methods express each data point as a linear combination of points (so called dictionary points each of which gives arise to the function equation image):

  • equation image

where |Rreduced|≪|R| and the resulting kernel sum can be evaluated more quickly. In the physics community, uniform grid points are chosen and points are projected on Fourier bases (i.e., b(·,·) is the Fourier basis). Depending on how the particle-particle interactions are treated, an fast fourier transform (FFT)-based summation method belongs to the category of particle-particle-particle mesh (P3M) method or particle-mesh (PM) method. However, these methods do not scale beyond three dimensions owing to uniform grids. Recently, machine learning practitioners have employed a variant of reduced set method that utilize positive-definiteness (or conditionally positive-definiteness) of the kernel function and successfully scaled many kernel methods such as SVM and GPR [21–24]. However, these methods require optimizing the basis points given a preselected error criterion (i.e., on reconstruction error in the reproducing kernel Hilbert space or generalization error with/without regularization) and the resulting dictionary Rreduced can be quite large in some cases.

Hierarchical methods. Most hierarchical methods using trees utilize series expansions (Fig. 1). The pseudocode for a dual-tree method [16] that subsumes most of hierarchical methods is shown in Algorithm 1. The first expansion called the far-field expansion summarizes the contribution of Rsub for a given query q:

  • equation image

where φm's and ψm's show dependence on the subset Rsub. The second type called the local expansion for qQsubQ expresses the contribution of Rsub near q:

  • equation image
thumbnail image

Figure 1. The reference points (the left tree) are hierarchically compressed and uncompressed when a pair of query (from the right tree)/reference nodes is approximated within an error tolerance. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

Thumbnail image of

Both representations are truncated at a finite number of terms depending on the level of prescribed accuracy, achieving &#55349;&#56490;(|Q|log |R|) runtime in most cases. To achieve &#55349;&#56490;(|Q| + |R|) runtime, we require an efficient linear operator that converts Mm(R) into Lm(R,Q)'s. Depending on the basis representations of φ's and ψ's, the far-to-local linear operator is diagonal and the translation is linear in the number of coefficients. There are many serial algorithms [16,18,25–29] that use different series expansion forms to bound error deterministically. Holmes et al. [8] propose a probabilistic approximation scheme based on the central limit theorem, and Lee and Gray [9] used both deterministic and probabilistic approximations. Especially, probabilistic approximations can help overcome the curse of dimensionality at the expense of indeterminism in approximated kernel sums.

In this paper, we focus on hierarchical methods because: (i) it is a natural framework to control approximation in a varying degree of resolution and (ii) the specialized acceleration techniques for positive-definite kernels can be plugged in as a special case. We would like to point out that the code base can also be used in scientific N-body simulations [15], but we will defer its applications in a future paper.

2.3. Parallelizations

Hierarchical N-body methods present an interesting challenge in parallelization: (i) both data distribution and work distribution are highly nonuniform across MPI processes and (ii) often involves long-range communication due to the kernel function k(·,·). In the worst case, every process will need almost every piece of data owned by the other processes. Here we discuss the three main important issues in a scalable distributed hierarchical N-body code:

Parallel tree building: Lashuk et al. [30] propose a novel distributed octree construction algorithm and a new reduction algorithm for evaluation to scale up to over 65K cores. Al-Furajh et al. [31] describe a parallel kd-tree construction on a distributed memory setting, while Choi et al. [32] work on a shared-memory setting. Liu et al. [33] discuss building spill-trees, a variant of metric trees that permit overlapping of data between two branches, using the map-reduce framework.

Load balancing: Most common static load balancing algorithms include: (i) the costzone [34] which partitions a pre-built query tree and assigns each query particle to a zone. A common approach employs a graph partitioner [35] and (ii) the ORB (orthogonal recursive bisection) which directly partitions each dimension of the space containing the query points in a cyclic fashion. Dynamic load balancing [36] strategies adjust the imbalance between the work loads during the computation.

Interprocess communication: The local essential trees approach [37] (which involves few large-grained communication) is a sender-initiated communication approach. Using the ORB, each process sends out essential data that may be needed by the other processes using the recursive doubling scheme (Fig. 2). An alternative approach has the receiver initiate communication; this approach involves many fine-grained communication, and is preferable if interprocess communication overheads are small. For more details, see ref. 38.

thumbnail image

Figure 2. Recursive doubling on the hypercube topology. Initially, each node begins with its own message (top left). The exchanges proceed in: the top right, the bottom left, then bottom right in order. Note that the amount of data exchanged in each stage doubles.

Download figure to PowerPoint

3. DISTRIBUTED MULTIDIMENSIONAL TREE

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

Our approach for building a general-dimension distributed tree closely follows [31]. Following the ORB (orthogonal recursive bisection) in ref. 37, we define the global tree (Fig. 3), which is a hierarchical decomposition of the data points on the process level. The local tree of each process is built on its own local data DP.

thumbnail image

Figure 3. Each process owns the global tree of processes (the top part) and its own local tree (the bottom part). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

Building the distributed tree. Initially, all MPI processes in a common MPI communicator agree on a rule for partitioning each of its data into two parts (Fig. 1). The MPI communicator is then split in two depending on the MPI process rank. This process is recursively repeated until there are log p levels (Fig. 4) in the global tree. Shared-memory parallelism can be utilized in the (independent) reduction step in each MPI process in generating the split rule (Fig. 1). Depending on a split rule and using C++ meta-programming, we can auto-generate any binary tree (see Table 3) utilizing an associative reduction operator for constructing bounding primitives. Generalizing to multidimensional trees with an arbitrary number of child nodes (such as cover-trees [20]) is left as a future work.

thumbnail image

Figure 4. Distributed memory parallelism in building the global tree (the first log p levels of the entire tree). Each solid arrow indicates a data exchange between two given processes. After exchanges on each level, the MPI communicator is split (shown as a dashed arrow) and the construction works in parallel subsequently. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.

Download figure to PowerPoint

Building the local tree. Here we closely follow the approach in ref. 32. The first few levels of the tree are built in a breadth-first manner with the assigned number of OpenMP threads proportional to the number of points participating in a reduction to form the bounding primitive (Fig. 5). The number of participating OpenMP threads per task halves as we descend each level. Each independent task with only one assigned OpenMP thread proceeds with the construction in a depth-first manner. We utilized the nested loop parallelization feature in OpenMP for this part.

thumbnail image

Figure 5. Shared-memory parallelism in building the local tree for each MPI process. The first top levels are built in a breadth-first manner with the number of threads proportional to the amount of performed reduction. Any task with one assigned thread proceeds in a depth-first manner. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

Overall runtime complexity. All-reduce operation on the hypercube topology takes &#55349;&#56490; (ts log p + twm(p − 1)) where ts, tw, and m are the latency constant, the bandwidth constant, and the message size respectively. Assume that each process starts with the same number of points equation image and each split on a global/local level results in equidistribution of points and only distributed memory parallelism is used (i.e., pthread = 1). Let mbound be the message size of the bounding primitive divided by D. The overall runtime for each MPI process is:

  • The reduction cost and the split cost at each level 0 ≤ i < log p: equation image

  • The all-reduce cost on each level 0 ≤ i < log p: equation image

  • The total latency cost at each level 0 ≤ i < log p: equation image.

  • The base case at the level log p (the depth-first build of local tree): equation image

Therefore, the overall complexity is: equation image. This implies that the growth of the number of data points must be N log N ∼&#55349;&#56490;(p2) to achieve the same level of parallel efficiency. Note that the last terms have zero contribution if p = 1.

4. OVERALL ALGORITHM

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

Algorithm 4 shows the overall algorithm. Initially, each MPI process initializes its distributed task queue by dividing its own local query subtree into a set of T query grain subtrees where T > pthread is more than the number of threads pthread running on each MPI process; initially each of these trees has no tasks. The tree walker object maintains a stack of pairs of Q and RP that must be considered. It is first initialized with the following tuple: the root node of Q, the root node of the local reference tree RP, and the probability guarantee α; the relative error tolerance/absolute error tolerance are global constants ε and τ respectively. Threads not involved with the tree walk or exchanging data can dequeue tasks from the local task queue.

Thumbnail image of
Thumbnail image of

4.1. Walking the Trees

Each MPI process takes the root node of the global query tree (the left tree) (Fig. 6) and the root node of its local reference tree (the right tree) and performs a dual-tree recursion (Algorithm 5). For simplicity, we show the case where the reference side is descended first then the query side. Any of the running threads can walk by dequeuing from the stack of frontier nodes, generate local tasks, and queue up reference subtrees to send to other processes. The expansion can be prioritized using the Heuristic function that takes a pair of query/reference nodes. It would be possible to extend the walking procedure to include fancier expansion patterns described in ref. 39.

thumbnail image

Figure 6. The global query tree is divided into a set of query subtrees each of which can queue up a set of reference subset to compute (shown vertically below each query subtree). The kernel summations for each query subtree can proceed in parallel. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

4.2. Message Passing

Inspired by the local essential trees approach, we develop a message passing system utilizing the recursive doubling scheme. We assume that the master thread is the only thread that may initiate MPI calls. The key differences from the vanilla local essential tree approach are twofold: (i) our framework can support computations that have dynamic work requirement, unlike FMM; (ii) our framework does not require each MPI process to accommodate all of the nonlocal data in its essential tree. Algorithm 6 shows the message passing routine called by the master threshold on each MPI process. Any message from a pair of processes in a hypercube topology needs at most log p rounds of routing. At each stage i, the process P with binary representation P = (blog p−1,…,bi+1,0,bi−1,…,b0)2 sends messages to process Pneighbor = (blog p−1,…,bi+1,1,bi−1,…,b0)2 (and vice versa). Here are the types of messages exchanged between a pair of processes:

Thumbnail image of

1. Reference subtrees: each MPI process sends out a reference subtree with the tag (Rsub,{Qsub}) where {Qsub} is the list of remote query subtrees that needs Rsub.

2. Work-complete message: whenever each thread finishes computing a task (Qsub,Rsub), it queues up a pair of completed work quantity and the list of all MPI ranks excluding the self. The form of the message is: (|Qsub||Rsub|,{0,…,P − 1,P + 1,p − 1})).

3. Extra tasks: one of the paired MPI processes can donate some of its tasks to the other (Section 4.3.). This has a form of (Qsub,{Rsub}) where {Rsub} is a list of reference subsets that must be computed for Qsub.

4. Imported query subtree flushes: during load balancing, query subtrees with several reference tasks may be imported from another process. These must be synchronized with the original query subtree on its originating process before tasks associated with it are dequeued.

5. The current load: the load is defined as the sum of |QsubRsub| associated with all query subtrees (both native and imported) on a given process.

Distributed termination detection. We follow a similar idea discussed in Section 14.7.4 of ref. 40, Initially, all MPI processes collectively have to complete |Q||R| amount of work. Each thread dequeues a work and completes a portion of its assigned local work (Fig. 6); the completed work quantity is then broadcast using the recursive doubling message passing to all the other processes. The completed and uncompleted work is conserved at any given point of time. When every process thinks all of |Q||R| work have been completed and it has sent out all of its queued up work-complete messages, it can safely terminate.

4.3. Load Balancing

Our framework employs both static load balancing and dynamic load balancing.

Static load balancing. Each MPI task is initially in charge of computing the kernel sums for all of its grain query subtrees. This approach is similar to the ORB approach where the distributed tree determines the task distribution.

Dynamic load balancing. It is likely that the initial query subtree assignments will cause imbalance among processes. During the computation, we allow each query task to migrate from the current P-th process to a neighboring Pneighbor-th process. We use a very simple scheme in which two processes that are paired up during each stage of the repeated recursive doubling stages attempt to load balance. Each process keeps sending out a snapshot of its computation load in the recursive doubling scheme, and maintains a table of estimated remaining amount of computation on the other processes. Therefore, load estimates could be outdated by the time a given process considers transferring extra tasks. Therefore, we employ a simple heuristic of initiating the load balance for a pair of imbalanced processes: if the estimated load on the process Pneighbor is below 0 < βthreshold < 1 of the current load on the process P, transfer 0.5(1 − βthreshold) amount of tasks from P to Pneighbor.

Thumbnail image of

5. EXPERIMENTAL RESULTS

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

We developed our code base in C++ called MLPACK [41] and utilized open-source libraries such as Boost library [42], Armadillo linear algebra library [43], and the GNU Scientific Library [44]. We have tested on the Hopper cluster at NERSC. Each node on the Hopper cluster has 24 cores, and we used the recommended setting of 6 OpenMP threads/node (pthread = 6) and a maximum 4 MPI tasks/node and compiled using GNU C++ compiler version 4.6.1 under the −O3 optimization flag. The configuration details are available at website in ref. 45.

Thumbnail image of

We chose to evaluate the scalability of our framework in the context of computing kernel density estimates [1]. We used the Epanechnikov kernel equation image since it is the most asymptotically optimal kernel. For the first part of our experiments, we considered uniformly distributed data points in the ten-dimensional hypercube [0,1]10 since nonparametric methods such as KDE and NWR require an exorbitant number of samples in the uniform distribution case. Applying nonparametric methods for higher dimensional datasets requires exploiting correlations between dimensions [46]. For the second part, we measured the strong scalability of our implementation on the SDSS dataset. All timings are maximum ones across all processes.

5.1. Scalability of Distributed Tree Building

We have compared the strong scalability of building two main tree structures: kd-trees and metric-trees on an uniformly distributed ten-dimensional dataset containing 20 029 440 points (Fig. 8). In all cases, building a metric-tree is more expensive than building a kd-tree; a reduction operation in Algorithm 3 for metric-trees involves distance computations whereas the reduction operator for kd-trees is the computation of minimum/maximum. For the weak-scaling result (shown in Fig. 9), we added 166 912 ten-dimensional data points per core up to 1 025 507 328 points. Our analysis in Section 3. has shown that the exact distributed tree building algorithm requires the growth of the data points to be N log NO(p2), and this is reflected in our experimental results.

However, readers should note that: (i) the depth of the trees built in our setting is much deeper than the ones in other papers [30]. Each leaf in our tree contains 40 points; (ii) the tree building is empirically fast. On 6144 cores, we were able to build a kd-tree on over one billion ten-dimensional data points under 30 s; (iii) the one-time cost of building the distributed tree can be amortized over many queries.

Liu et al. [33] took a simple map-reduce approach in building a multidimensional binary tree (hybrid spill-trees specifically). We conjecture that this approach may be faster to build but results in slower query times due to generating suboptimal partitions. Future experiments will reveal its strengths and the weaknesses.

5.2. Scalability of Kernel Summation

In this experiment, we measure the scalability of the overall kernel summation. Our algorithm has three main parts: building the distributed tree (Algorithm 1), walking the tree to generate the tasks (Algorithm 5, Fig. 7), and performing reductions on the generated tasks (Fig. 6). The kernel summation algorithm tested here employs only the deterministic approximations [6,7]. We used ε = 0.1, τ = 0, and α = 1 (see Definition 3).

thumbnail image

Figure 7. Illustration of the tree walk performed by the 0-th MPI process in a group of 4 MPI processes. Iteration 0: starting with the global query tree root and the root node of the local reference tree owned by the 0-th MPI process; Iterations 1–2: descend the reference side before expanding the query side; Iteration 3: the reference subtree 12 is pruned for the 0-th and 1st MPI processes; Iterations 6–7: the reference subtree 12 is hashed to the list of subtrees to be considered for the query subtrees 8 and 9 (owned by the 2nd MPI process); Iteration 8: the reference subtree 12 is pruned for the 3rd MPI process. Iteration 9: the reference subtree 13 is considered subsequently after the reference subtree 12. At this point, the hashed reference subtree list includes (12, {8, 9}). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

Weak scaling. (Fig. 10) We measured the weak scalability of all phases of computation (the distributed tree building, the tree walk, and the computation). The data distribution we consider is a set of uniformly distributed ten-dimensional points. We vary the number of cores from 96 to 6144, adding 166 912 points per core. We used ε = 0.1 and decreased the bandwidth parameter h as more cores are added to keep the number of distance computations constant per core; a similar experiment setup was used in ref. 47, though we plan to perform more thorough evaluations. The timings for the computation maintains around 60% parallel efficiency above 96 cores. 8

thumbnail image

Figure 8. Strong scaling result for distributed kd-tree building on an uniform point distribution in the ten-dimensional unit hypercube [0,1]10. The dataset has 20 029 440 points. The base timings for 6 cores are 105 and 52.9 s for metric-tree and kd-tree respectively. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

thumbnail image

Figure 9. Weak scaling result for distributed kd-tree building on an uniform point distribution in ten dimensions. We used 166 912 points per core. The base timing for 6 cores is 2.81 s. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

thumbnail image

Figure 10. Weak scaling result for overall kernel summation computation on an uniform point distribution in ten dimensions. We used 166 912 points per core and ε = 0.1 and equation image, halving h for every fourfold increase in the number of cores. The base-timings for 6 cores are: 2.84 s for tree building, 1.8 s for the tree walk, and 128 s for the computation. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

Strong scaling. Figure 11 presents strong scaling results on a 10 million/four-dimensional subset of the SDSS dataset. We used the Epanechnikov kernel with h = 0.000030518 (chosen by the plug-in rule) with ε = 0.1.

thumbnail image

Figure 11. Strong scaling result for overall kernel summation computation on the 10 million subset of SDSS Data Release 6 . The base timings for 24 cores are: 13.5, 340, and 2370 s for tree building, tree walk, and computation respectively. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

6. POTENTIAL GPGPU EXTENSION AND ANALYSIS

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

General-purpose computing on graphics processing units (GPGPUs) was originally designed for massively parallel and computationally intensive graphics applications. With recent development in GPU programming models (e.g., CUDA [48], OpenCL), support for double precision floating point calculation, and attractive dollar/flop, it is becoming increasingly common to use GPUs for computationally intensive tasks. Today, three out of ten fastest supercomputers make use of GPU acceleration [49]. In this section, we explore a possibility of GPU-based acceleration in kernel summation computations.

GPUs are made up of array of streaming multiprocessors (SMs). Each multiprocessor consists of a number of simple cores called streaming processors (SPs). The exact number of SPs in a multiprocessor depends on the device. Each SP in a single SM executes the same instruction but on different data or performs no operation. CUDA programming model allows us to write serial program called kernel. Computation is partitioned into threads, blocks of threads, and grid of thread-blocks. Each thread block consists of a fixed number of threads. Each SM is responsible for scheduling and executing a thread block. A thread block is executed by a single SM in groups of threads called warp. The current size of warp for Nvidia devices is 32 threads. Each kernel is invoked with a specified number of grid of thread blocks and threads per block.

In a CPU-only tree-based kernel summation algorithm, leaf computations (direct sums) account for a significant part of the overall computation time. Because these operations are embarrassingly parallel, they become an attractive choice for GPU acceleration.

Our GPU-based direct summation routine consists of two GPU kernels. The first kernel KCompute computes the elements of K matrix for a given pair of query subset Qsub and reference subset Rsub. In KCompute kernel, each thread computes one element of K matrix. Since each qiQsub and rjRsub are accessed multiple times in the computation, our GPU kernel utilizes the shared memory of the multiprocessor to cache a subset of Qsub and Rsub; this reduces the global memory traffic. The second kernel Reduction performs the reduction operation on the K matrix to get the final direct summation for each qiQsub. Note that the KCompute kernel is compute-bound while the Reduction kernel is memory-bound.

We ran our GPU code on a Nvidia M2090 ‘Fermi’ GPU. M2090 has 16 SMs. Each SM support 32 SPs, thus a total of 512 threads concurrently. It has a memory bandwidth of 150 GB/s and a peak double precision performance of 665 GFlops/s. The compute node consisted of two Xeon X5570 quad-core processor running at 2.93 GHz, 8 MB of L2 cache, and a memory bandwidth of 32 GB/s.

Initially, we ran experiments to compare the performance of a stand-alone GPU-based direct summation kernel with a sequential CPU-based direct summation routine. Panel (a) in Fig. 12 shows the comparison of two kernels when the dimensionality is varied while keeping the query set and reference set constant at 4096 points. Similarly, panel (b) in Fig. 12 shows the comparison of the two when the dimensionality is fixed at 1024 and the number of query/reference points is varied. While the CPU routine dominates the GPU kernel before the cross-over point at 512 points, the GPU kernel achieves a higher flop count after the cross-over point. For both plots, the optimal leaf size for the GPU kernel was determined for a given number of points (for (a)) or a given dimension (for (b)). Integrating the GPU kernel into the distributed framework is a work in progress.

We now estimate the potential speedup with our GPU kernels if they replace the CPU direct evaluation routine. For the moment, we focus on a shared-memory setting with no distributed parallelism. For a given kernel and its fixed bandwidth, we first measure the total elapsed time (time spent in building the tree, direct evaluation, tree traversal) for the multi-threaded tree code. Since the optimal leaf size for CPU and GPU differs, we vary the leaf size and compare the best case for both of them. We consider varying the dimensionality while fixing the number of points at 16 384 points. We used the Gaussian kernel with the bandwidth value of 2.0 with the data points uniformly generated from [−1,1]D. See Fig. 13.

thumbnail image

Figure 12. GFlops achieved for the GPU direct evaluation kernel and the CPU direct evaluation routine. (a) Fixing the number of query points and the number of reference points at 4096 and varying the dimensionality. (b) Fixing the dimensionality at 1024 and varying the number of query/reference points. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

thumbnail image

Figure 13. For 16 384 query/reference points; x-axis plots the dimensionality while the y-axis plots the elapsed time in seconds. For each dimension, the left plots the best time achieved by the pure CPU-based tree-code while the right plots the best time achieved by the hypothetical CPU/GPU hybrid code. cpu overhead denotes the total time spent in building and performing the tree traversals. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

7. CONCLUSION

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

In this paper, we proposed a hybrid MPI/OpenMP kernel summation framework for scaling many popular data analysis methods. Our approach has advantages including: (i) the platform-independent C++ code base that utilize standard protocols such as MPI and OpenMP; (ii) the template code structure that uses any multidimensional binary trees and any approximation schemes that may be suitable for high-dimensional problems; and (iii) extendibility to a large class of problems that require fast evaluations of kernel sums. Our future work will address: (i) distributed computation on unreliable network connections; (ii) extending to take advantage of heterogeneous architectures including GPGPUs for a hybrid MPI/OpenMP/CUDA framework; and (iii) extension of the parallel engine to handle problems with more than pair-wise interactions, such as the computation of n-point correlation functions [16,50].

Acknowledgements

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES

This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

REFERENCES

  1. Top of page
  2. Abstract
  3. 1.INTRODUCTION
  4. 2. RELATED WORK
  5. 3. DISTRIBUTED MULTIDIMENSIONAL TREE
  6. 4. OVERALL ALGORITHM
  7. 5. EXPERIMENTAL RESULTS
  8. 6. POTENTIAL GPGPU EXTENSION AND ANALYSIS
  9. 7. CONCLUSION
  10. Acknowledgements
  11. REFERENCES
  • 1
    E. Parzen, On estimation of a probability density function and mode, Ann Math Stat 33(3) (1962), 10651076.
  • 2
    E. Nadaraya, On estimating regression, Theory Prob Appl 9 (1964), 141142.
  • 3
    C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning), The MIT Press, Cambridge, MA, 2005.
  • 4
    B. Scholkopf, A. Smola, and K. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput 10(5) (1998), 12991319.
  • 5
    B. Scholkopf and A. Smola, Learning with Kernels: support vector machines, regularization, Optimization, and Beyond, Vol. 1, MIT Press, Cambridge, MA, 2002, 2.
  • 6
    D. Lee, A. Gray, and A. Moore, Dual-tree fast gauss transforms, In Advances in Neural Information Processing Systems, Vol. 18, Y. Weiss, B. Schölkopf, and J. Platt, eds. MIT Press, Cambridge, MA, 2006, 747754.
  • 7
    D. Lee and A. Gray, Faster Gaussian summation: theory and experiment. Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence, 2006.
  • 8
    M. Holmes, A. Gray, and C. Isbell Jr, Ultrafast Monte Carlo for kernel estimators and generalized statistical summations, Adv Neural Inf Process Syst 21 (2008).
  • 9
    D. Lee and A. Gray, Fast high-dimensional kernel summations using the monte carlo multipole method, Adv Neural Inf Process Syst 21 (2009), 929936.
  • 10
    A. Rahimi and B. Recht, Random features for large-scale kernel machines, Adv Neural inf Process Syst 20, 2008, 11771184.
  • 11
    J. L. Bentley, Multidimensional binary search trees used for associative searching, Commun ACM 18 (1975), 509517.
  • 12
    S. M. Omohundro, Five Balltree Construction Algorithms. Technical Report TR-89-063, International Computer Science Institute, 1989.
  • 13
    P. Yianilos, Data structures and algorithms for nearest neighbor search in general metric spaces, Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 1993, 311321.
  • 14
    S. Dasgupta and Y. Freund, Random projection trees and low dimensional manifolds, Proceedings of the 40th Annual ACM Symposium on Theory of Computing, ACM, British Columbia, Canada, 2008, 537546.
  • 15
    P. Liu and J. Jan Wu, A framework for parallel tree-based scientific simulations, In Proceedings of 26 th International Conference on Parallel Processing, 1997, 137144.
  • 16
    A. Gray and A. W. Moore, N-body problems in statistical learning, In Advances in Neural Information Processing Systems, T. K. Leen, T. G. Dietterich, and V. Tresp, eds. 2000, MIT Press, Cambridge, MA, 2001.
  • 17
    D. York, J. Adelman, J. Anderson Jr, S. Anderson, J. Annis, N. Bahcall, J. Bakken, R. Barkhouser, S. Bastian, E. Berman , et al, The sloan digital sky survey: Technical summary, Astron J 120 (2000), 1579.
  • 18
    L. Greengard and J. Strain, The fast gauss transform, SIAM J Sci Stat Comput 12(1) (1991), 7994.
  • 19
    C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis, Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.
  • 20
    A. Beygelzimer, S. Kakade, and J. Langford, Cover trees for nearest neighbor, Proceedings of the 23rd International Conference on Machine Learning, New York, ACM, 2006, 97104.
  • 21
    A. Smola and B. Scholkopf, Sparse greedy matrix approximation for machine learning, In Proceedings of the Seventeenth International Conference on Machine Learning, 2000, 911918.
  • 22
    A. Smola and P. Bartlett, Sparse greedy Gaussian process regression, Adv Neural Inf Process Syst 13 (2001), 619625.
  • 23
    M. Ouimet and Y. Bengio, Greedy spectral embedding, Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Citeseer, Barbados, West Indies, 2005, 253260.
  • 24
    M. Seeger, C. Williams, N. Lawrence, and S. Dp, Fast forward selection to speed up sparse Gaussian process regression, In Workshop on AI and Statistics 9, 2003.
  • 25
    A. Appel, An efficient program for many-body simulation, SIAM J Sci Stat Comput 6 (1985), 85.
  • 26
    J. Barnes and P. Hut, A Hierarchical O(N log N) force-calculation algorithm, Nature 324 (1986), 446449.
  • 27
    L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, J Comput Phys 73 (1987), 325348.
  • 28
    P. Callahan and S. Kosaraju, A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields, JACM 42(1) 1995, 6790.
  • 29
    L. Ying, G. Biros, and D. Zorin, A kernel-independent adaptive fast multipole algorithm in two and three dimensions, J Comput Phys 196(2) (2004), 591626.
  • 30
    I. Lashuk, A. Chandramowlishwaran, H. Langston, T. Nguyen, R. Sampath, A. Shringarpure, R. Vuduc, L. Ying, D. Zorin, and G. Biros, A massively parallel adaptive fast-multipole method on heterogeneous architectures, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM, Portland, OR, 2009, 112.
  • 31
    I. Al-Furajh, S. Aluru, S. Goil, and S. Ranka, Parallel construction of multidimensional binary search trees, IEEE Transactions on Parallel and Distributed Systems 11(2) (2002), 136148.
  • 32
    B. Choi, R. Komuravelli, V. Lu, H. Sung, R. Bocchino, S. Adve, and J. Hart, Parallel SAH kD tree construction, Proceedings of the Conference on High Performance Graphics, Eurographics Association, Saarbrücken, Germany, 2010, 7786.
  • 33
    T. Liu, C. Rosenberg, and H. Rowley, Clustering billions of images with large scale nearest neighbor search, Proceedings of the Eighth IEEE Workshop on Applications of Computer Vision, IEEE Computer Society, Austin, TX, 2007, 28.
  • 34
    J. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy, Load balancing and data locality in adaptive hierarchical n-body methods: Barnes-hut, fast multipole, and radiosity, J. of Parallel & Distributed Comp 27(2) (1995), 118141,
  • 35
    F. Cruz, M. Knepley, and L. Barba, PetFMM-A dynamically load-balancing parallel fast multipole library, Arxiv preprint arXiv:0905 2637 (2009), 403428.
  • 36
    P. Loh, W. Hsu, C. Wentong, and N. Sriskanthan, How network topology affects dynamic loading balancing, Parallel & Distributed Tech, IEEE 4(3) (1996), 2535.
  • 37
    J. K. Salmon, Parallel Hierarchical N-body Methods. Ph.D Thesis, California Institute of Technology, 1990.
  • 38
    J. Singh, J. Hennessy, and A. Gupta, Implications of hierarchical n-body methods for multiprocessor architectures, ACM Trans Comput Syst 13(2) 1995, 141202.
  • 39
    R. Riegel, A. Gray, and G. Richards, Massive-scale kernel discriminant analysis: mining for quasars, SIAM International Conference on Data Mining, Citeseer, Atlanta, GA, 2008.
  • 40
    P. Pacheco, Parallel Programming with MPI, Morgan Kaufmann, Burlington, MA, 1997.
  • 41
    G. Boyer, R. Riegel, N. Vasiloglou, D. Lee, L. Poorman, C. Mappus, N. Mehta, H. Ouyang, P. Ram, L. Tran, W. C. Wong, and A. Gray, MLPACK. http://mloss.org/software/view/152, Accessed on September, 2011 and October, 2011, 2009.
  • 42
    S. Koranne, Boost C++ Libraries, Handbook of Open Source Tools, Springer, New York, 2011, 127143.
  • 43
    C. Sanderson, Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments, NICTA, Australia, Technical Report, 2010.
  • 44
    G. P. Contributors, GSL - GNU scientific library - GNU project - free software foundation (FSF). http://www.gnu.org/software/gsl/, Accessed on September, 2011 and October, 2011, 2010.
  • 45
    NERSC Computational Systems. http://www.nersc.gov/users/computational-systems/ Accessed on September, 2011 and October, 2011.
  • 46
    A. Ozakin and A. Gray, Submanifold density estimation, Advances in Neural Information Processing Systems 22 (2009), 13751382.
  • 47
    R. Sampath, H. Sundar, and S. Veerapaneni, Parallel fast gauss transform, In Supercomputing, 2010.
  • 48
    http://www.nvidia.com/object/cuda_home_new.html. Accessed on September, 2011 and October, 2011.
  • 49
    http://www.top500.org. Accessed on September, 2011 and October, 2011.
  • 50
    A. Moore, A. Connolly, C. Genovese, A. Gray, L. Grone, N. Kanidoris, R. Nichol, J. Schneider, A. Szalay, I. Szapudi, and L. Wasserman, Fast algorithms and efficient statistics: N-point correlation functions, In Proceedings of MPA/MPE/ESO Conference Mining the Sky, Garching, Germany, 2000.