“**Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods**”

Nathan Bell, Steven Dalton, and Luke Olson*SIAM Journal on Scientific Computing (accepted)*

Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demandsalgorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of 1.8× in the setup phase and 5.7× in the cycling phase when compared to a representative CPU implementation.

“**PyDEC: Software and Algorithms for Discretization of Exterior Calculus**”

Nathan Bell and Anil Hirani*ACM Transactions on Mathematical Software (accepted)*

This paper describes the algorithms, features and implementation of PyDEC, a Python library for computations related to the discretization of exterior calculus. PyDEC facilitates inquiry into both physical problems on manifolds as well as purely topological problems on abstract complexes. We describe efficient algorithms for constructing the operators and objects that arise in discrete exterior calculus, lowest order finite element exterior calculus and in related topological problems. Our algorithms are formulated in terms of high-level matrix operations which extend to arbitrary dimension. As a result, our implementations map well to the facilities of numerical libraries such as NumPy and SciPy. The availability of such libraries makes Python suitable for prototyping numerical methods. We demonstrate how PyDEC is used to solve physical and topological problems through several concise examples.

“**Thrust: A Productivity-Oriented Library for CUDA**”

Nathan Bell and Jared Hoberock*GPU Computing Gems, Jade Edition, Edited by Wen-mei W. Hwu, October 2011*

This chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard Template Library (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing while remaining fully interoperable with the rest of the CUDA software ecosystem. Applications written with Thrust are concise, readable, and efficient.

“**Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods**”

Nathan Bell, Steven Dalton, and Luke Olson*NVIDIA Technical Report NVR-2011-002, June 2011*

Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of over 2x in the setup phase and around 6x in the cycling phase when compared to a representative CPU implementation.

“**Sparse Matrix-Vector Multiplication on Multicore and Accelerators**”

Sam Williams, Nathan Bell, Jee Whan Choi, Michael Garland, Leonid Oliker, and Richard Vuduc*Scientific Computing on Multicore and Accelerators, December 2010*

This chapter summarizes recent work on the development of highperformance multicore and accelerator-based implementations of sparse matrix-vector multiplication (SpMV). As an object of study, SpMV is an interesting computation for at least two reasons. First, it appears widely in applications in scientific and engineering computing, nancial and economic modeling, and information retrieval, among others, and is therefore of great practical interest. Secondly, it is both simple to describe but challenging to implement well, since its performance is limited by a variety of factors, including low computational intensity, potentially highly irregular memory access behavior, and a strong input-dependence that be known only at run-time. Thus, we believe SpMV both practically important as well as insightful for understanding the algorithmic and implementation principles necessary to making eff ective use of state-of-the-art systems.

“**Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors**”

Nathan Bell and Michael Garland*Proc. Supercomputing ‘09, November 2009*

Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.

“**Efficient Sparse Matrix-Vector Multiplication on CUDA**”

Nathan Bell and Michael Garland*NVIDIA Technical Report NVR-2008-004, December 2008*

The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra.

In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while offering alternatives which accommodate greater irregularity.

On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel Clovertown system.

“**Algebraic Multigrid for Discrete Differential Forms**”

Nathan Bell*PhD Thesis (UIUCDCS-R-2008-2986), August 2008*

Discrete differential forms arise in scientific disciplines ranging from computational electromagnetics to computer graphics. Examples include stable discretizations of the eddy-current problem, topological methods for sensor network coverage, visualization of complex flows, surface parameterization, and the design of vector fields on meshes. In this thesis we describe efficient and scalable numerical solvers for discrete k-form problems. Our approach is based on the principles of algebraic multigrid (AMG) which is designed to solve large-scale linear systems with optimal, or near-optimal efficiency. Since the k-form problems to be solved are arbitrarily large, the need for scalable numerical solvers is clear.

“**Algebraic Multigrid for k-form Laplacians**”

Nathan Bell and Luke N. Olson*Numerical Linear Algebra with Applications, Volume 15, Issue 2-3, Pages 165-185, February 2008*

In this paper we describe an aggregation-based algebraic multigrid method for the solution of discrete k-form Laplacians. Our work generalizes Reitzinger and Schöberl’s algorithm to higher-dimensional discrete forms. We provide conditions on the tentative prolongators under which the commutativity of the coarse and fine de Rham complexes is maintained. Further, a practical algorithm that satisfies these conditions is outlined, and smoothed prolongation operators and the associated finite element spaces are highlighted. Numerical evidence of the efficiency and generality of the proposed method is presented in the context of discrete Hodge decompositions.

“**A Fast Multigrid Algorithm for Mesh Deformation**”

Lin Shi, Yizhou Yu, Nathan Bell, and Wei-Wen Feng*SIGGRAPH 2006 (ACM Transactions on Graphics, Vol. 24, No. 3)*

In this paper, we present a multigrid technique for efficiently deforming large surface and volume meshes. We show that a previous least-squares formulation for distortion minimization reduces to a Laplacian system on a general graph structure for which we derive an analytic expression. We then describe an efficient multigrid algorithm for solving the relevant equations. Here we develop novel prolongation and restriction operators used in the multigrid cycles. Combined with a simple but effective graph coarsening strategy, our algorithm can outperform other multigrid solvers and the factorization stage of direct solvers in both time and memory costs for large meshes. It is demonstrated that our solver can trade off accuracy for speed to achieve greater interactivity, which is attractive for manipulating large meshes. Our multigrid solver is particularly well suited for a mesh editing environment which does not permit extensive precomputation. Experimental evidence of these advantages is provided on a number of meshes with a wide range of size. With our mesh deformation solver, we also successfully demonstrate that visually appealing mesh animations can be generated from both motion capture data and a single base mesh even when they are inconsistent.

“**Particle-Based Simulation of Granular Materials**”

Nathan Bell, Yizhou Yu, and Peter J. Mucha*ACM SIGGRAPH/Eurographics Symposium on Computer Animation 2005*

Granular materials, such as sand and grains, are ubiquitous. Simulating the 3D dynamic motion of such materials represents a challenging problem in graphics because of their unique physical properties. In this paper we present a simple and effective method for granular material simulation. By incorporating techniques from physical models, our approach describes granular phenomena more faithfully than previous methods. Granular material is represented by a large collection of non-spherical particles which may be in persistent contact. The particles represent discrete elements of the simulated material. One major advantage of using discrete elements is that the topology of particle interaction can evolve freely. As a result, highly dynamic phenomena, such as splashing and avalanches, can be conveniently generated by this meshless approach without sacrificing physical accuracy. We generalize this discrete model to rigid bodies by distributing particles over their surfaces. In this way, two-way coupling between granular materials and rigid bodies is achieved.