ZIB

1

Unknown

Porting a Legacy CUDA Stencil Code to oneAPI (2020)

Christgau, Steffen (Dr.) ; Steinke, Thomas (Dr.)

add to mindlist on the mindlist

Publication Date: 2020-10-16

Description: Recently, Intel released the oneAPI programming environment. With Data Parallel C++ (DPC++), oneAPI enables codes to target multiple hardware architectures like multi-core CPUs, GPUs, and even FPGAs or other hardware using a single source. For legacy codes that were written for Nvidia GPUs, a compatibility tool is provided which facilitates the transition to the SYCL-based DPC++ programming language. This paper presents early experiences when using both the compatibility tool and oneAPI as well the employed extension to the SYCL programming standard for the tsunami simulation code easyWave. A performance study compares the original code running on Xeon processors using OpenMP as well as CUDA with the performance of the DPC++ counter part on multicore CPUs as well as integrated GPUs.

Language: English

Type: conferenceobject , doc-type:conferenceObject

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

OPUS

Overview

2

Unknown

Leveraging a Heterogeneous Memory System for a Legacy Fortran Code: The Interplay of Storage Class Memory, DRAM and OS (2020)

Christgau, Steffen (Dr.) ; Steinke, Thomas (Dr.)

add to mindlist on the mindlist

Details

Publication Date: 2022-06-13

Description: Large capacity Storage Class Memory (SCM) opens new possibilities for workloads requiring a large memory footprint. We examine optimization strategies for a legacy Fortran application on systems with an heterogeneous memory configuration comprising SCM and DRAM. We present a performance study for the multigrid solver component of the large-eddy simulation framework PALM for different memory configurations with large capacity SCM. An important optimization approach is the explicit assignment of storage locations depending on the data access characteristic to take advantage of the heterogeneous memory configuration. We are able to demonstrate that an explicit control over memory locations provides better performance compared to transparent hardware settings. As on aforementioned systems the page management by the OS appears as critical performance factor, we study the impact of different huge page settings.

Language: English

Type: conferenceobject , doc-type:conferenceObject

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

OPUS

Overview

3

Unknown

The HighPerMeshes Framework for Numerical Algorithms on Unstructured Grids (2021)

Alhaddad, Samer ; Förstner, Jens (Prof. Dr.) ; Groth, Stefan ; [et al.]

add to mindlist on the mindlist

Details

Publication Date: 2022-12-05

Description: Solving PDEs on unstructured grids is a cornerstone of engineering and scientific computing. Heterogeneous parallel platforms, including CPUs, GPUs, and FPGAs, enable energy-efficient and computationally demanding simulations. In this article, we introduce the HPM C++-embedded DSL that bridges the abstraction gap between the mathematical formulation of mesh-based algorithms for PDE problems on the one hand and an increasing number of heterogeneous platforms with their different programming models on the other hand. Thus, the HPM DSL aims at higher productivity in the code development process for multiple target platforms. We introduce the concepts as well as the basic structure of the HPM DSL, and demonstrate its usage with three examples. The mapping of the abstract algorithmic description onto parallel hardware, including distributed memory compute clusters, is presented. A code generator and a matching back end allow the acceleration of HPM code with GPUs. Finally, the achievable performance and scalability are demonstrated for different example problems.

Language: English

Type: article , doc-type:article

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

OPUS

Overview

4

Unknown

Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture (2014)

Wende, Florian ; Steinke, Thomas (Dr.) ; Cordes, Frank (Dr.)

add to mindlist on the mindlist

Details

Publication Date: 2022-12-12

Description: Small-scale computations usually cannot fully utilize the compute capabilities of modern GPGPUs. With the Fermi GPU architecture Nvidia introduced the concurrent kernel execution feature allowing up to 16 GPU kernels to execute simultaneously on a shared GPU device for a better utilization of the respective resources. Insufficient scheduling capabilities in this respect, however, can significantly reduce the theoretical concurrency level. With the Kepler GPU architecture Nvidia addresses this issue by introducing the Hyper-Q feature with 32 hardware managed work queues for concurrent kernel execution. We investigate the Hyper-Q feature within heterogeneous workloads with multiple concurrent host threads or processes offloading computations to the GPU each. By means of a synthetic benchmark kernel and a hybrid parallel CPU-GPU real-world application, we evaluate the performance obtained with Hyper-Q on GPU and compare it against a kernel reordering mechanism introduced by the authors for the Fermi architecture.

Language: English

Type: reportzib , doc-type:preprint

Format: application/pdf

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

OPUS

PDF

Overview

5

Unknown

An Early Scalability Study of Omni-Path Express (2022)

Brook, Glenn ; Fuller, Douglas ; Swinburne, John ; [et al.]

add to mindlist on the mindlist

Details

Publication Date: 2023-01-09

Description: This work provides a brief description of Omni-Path Express and the current status of its development, stability, and performance. Basic benchmarks that highlight the gains of OPX over PSM2 are provided, and the results of an initial performance and scalability study of several applications are presented.

Language: English

Type: conferenceobject , doc-type:conferenceObject

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

OPUS

Overview

6

Unknown

A First Step towards Support for MPI Partitioned Communication on SYCL-programmed FPGAs (2022)

Christgau, Steffen (Dr.) ; Knaust, Marius ; Steinke, Thomas (Dr.)

add to mindlist on the mindlist

Details

Publication Date: 2023-07-17

Description: Version 4.0 of the Message Passing Interface standard introduced the concept of Partitioned Communication which adds support for multiple contributions to a communication buffer. Although initially targeted at multithreaded MPI applications, Partitioned Communication currently receives attraction in the context of accelerators, especially GPUs. In this publication it is demonstrated that this communication concept can also be implemented for SYCL-programmed FPGAs. This includes a discussion of the design space and the presentation of a prototypical implementation. Experimental results show that a lightweight implementation on top of an existing MPI library is possible. In addition, the presented approach also reveals issues in both the SYCL and the MPI standard which need to be addresses for improved support of the intended communication style.

Language: English

Type: conferenceobject , doc-type:conferenceObject

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

OPUS

Overview

7

Unknown

Efficient adaptivity for simulating cardiac electrophysiology with spectral deferred correction methods (2022)

Chegini, Fatemeh (Dr.) ; Steinke, Thomas (Dr.) ; Weiser, Martin (Dr.)

add to mindlist on the mindlist

Details

Publication Date: 2023-12-18

Description: The locality of solution features in cardiac electrophysiology simulations calls for adaptive methods. Due to the overhead incurred by established mesh refinement and coarsening, however, such approaches failed in accelerating the computations. Here we investigate a different route to spatial adaptivity that is based on nested subset selection for algebraic degrees of freedom in spectral deferred correction methods. This combination of algebraic adaptivity and iterative solvers for higher order collocation time stepping realizes a multirate integration with minimal overhead. This leads to moderate but significant speedups in both monodomain and cell-by-cell models of cardiac excitation, as demonstrated at four numerical examples.

Language: English

Type: conferenceobject , doc-type:conferenceObject

Permalink

Library	Location	Call Number	Volume/Issue/Year	Availability

Others were also interested in ...

OPUS

Overview