Tag Archives: CUDA

Strided Memory Access on CPUs, GPUs, and MIC

Optimization guides for GPUs discuss in length the importance of contiguous ("coalesced", etc.) memory access for achieving high memory bandwidth (e.g. this parallel4all blog post). But how does strided memory access compare across different architectures? Is this something specific to NVIDIA GPUs? Let's shed some light on these questions by some benchmarks. Continue reading →

GPU Memory Bandwidth vs. Thread Blocks (CUDA) / Workgroups (OpenCL)

The massive parallelism of GPUs provides ample of performance for certain algorithms in scientific computing. At the same time, however, Amdahl's Law imposes limits on possible performance gains from parallelization. Thus, let us look in this blog post on how *few* threads one can launch on GPUs while still getting good performance (here: memory bandwidth). Continue reading →

GPU Research Center at TU Wien

Today it was announced that TU Wien hosts an NVIDIA GPU Research Center, for which Josef Weinbub, Florian Rudolf, and I are PIs. The agenda includes improvements to ViennaCL as well as PETSc, both open source libraries I'm actively involved in. In addition to continued, incremental improvements, we will also look into two interesting research questions related to the numerical solution of partial differential equations. Continue reading →

Karl Rupp

Computational Scientist

Tag Archives: CUDA

Strided Memory Access on CPUs, GPUs, and MIC

GPU Memory Bandwidth vs. Thread Blocks (CUDA) / Workgroups (OpenCL)

GPU Research Center at TU Wien