ZIB

Hits per page

hits 1 - 2 | 2 hits

Sorting

Electronic Resource

Global arrays: A nonuniform memory access programming model for high-performance computers (1996)

Nieplocha, Jaroslaw ; Harrison, Robert J. ; Littlefield, Richard J.

Springer

The journal of supercomputing 10 (1996), S. 169-189

add to mindlist on the mindlist

Details

ISSN: 1573-0484

Keywords: NUMA architecture ; parallel programming models ; shared memory ; parallel programming environments ; distributed arrays ; global arrays ; one-sided communication ; scientific computing ; Grand Challenges ; computational chemistry

Source: Springer Online Journal Archives 1860-2000

Topics: Computer Science

Notes: Abstract Portability, efficiency, and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes an approach, called Global Arrays (GAs), that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GAs is that they provide a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. We have implemented the GA library on a variety of computer systems, including the Intel Delta and Paragon, the IBM SP-1 and SP-2 (all message passers), the Kendall Square Research KSR-1/2 and the Convex SPP-1200 (nonuniform access shared-memory machines), the CRAY T3D (a globally addressable distributed-memory computer), and networks of UNIX workstations. We discuss the design and implementation of these libraries, report their performance, illustrate the use of GAs in the context of computational chemistry applications, and describe the use of a GA performance visualization tool.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1007/BF00130708

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

Electronic Resource

Investigating the performance of parallel eigensolvers for large processor counts (1993)

Littlefield, Richard J. ; Maschhoff, Kristyn J.

Springer

Theoretical chemistry accounts 84 (1993), S. 457-473

add to mindlist on the mindlist

Details

ISSN: 1432-2234

Keywords: Eigensolving ; Massively parallel computers ; Small dense matrices

Source: Springer Online Journal Archives 1860-2000

Topics: Chemistry and Pharmacology

Notes: Summary Eigensolving (diagonalizing) small dense matrices threatens to become a bottleneck in the application of massively parallel computers to electronic structure methods. Because the computational cost of electronic structure methods typically scales asO(N 3) or worse, even teraflop computer systems with thousands of processors will often confront problems withN 10,000. At present, diagonalizing anN×N matrix onP processors is not efficient whenP is large compared toN. The loss of efficiency can make diagonalization a bottleneck on a massively parallel computer, even though it is typically a minor operation on conventional serial machines. This situation motivates a search for both improved methods and identification of the computer characteristics that would be most productive to improve. In this paper, we compare the performance of several parallel and serial methods for solving dense real symmetric eigensystems on a distributed memory message passing parallel computer. We focus on matrices of sizeN=200 and processor countsP=1 toP=512, with execution on the Intel Touchstone DELTA computer. The best eigensolver method is found to depend on the number of available processors. Of the methods tested, a recently developed Blocked Factored Jacobi (BFJ) method is the slowest for smallP, but the fastest for largeP. Its speed is a complicated non-monotonic function of the number of processors used. A detailed performance analysis of the BFJ method shows that: (1) the factor most responsible for limited speedup is communication startup cost; (2) with current communication costs, the maximum achievable parallel speedup is modest (one order of magnitude) compared to the best serial method; and (3) the fastest solution is often achieved by using less than the maximum number of available processors.

Type of Medium: Electronic Resource

URL: http://dx.doi.org/10.1007/BF01113282

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

Paper (German National Licenses)

Fulltext

hits 1 - 2 | 2 hits