China Trumps Top500 with Sunway TaihuLight

The June 2016 update of the Top500 brought a new leader: The Sunway TaihuLight at the National Supercomputing Center in Wuxi, China. Given that Tianhe-2 has been leading the Top500 since three years, a new leader was overdue. Let us have a closer look at a couple of interesting details of Sunway TaihuLight. Continue reading →

FWF-Project: 3D Solution of the Boltzmann Equation on Supercomputers

The Austrian Science Fund (FWF) approved my project proposal entitled "3D Solution of the Boltzmann Equation on Supercomputers". This project will fund my scientific work for three more years, with prospective start in mid 2017. Here is a brief summary of what this project is about. Continue reading →

Three Suggestions for Improving OpenCL for Library Developers

OpenCL is not (yet) a success story in high performance computing. More researchers are drawn towards NVIDIA's CUDA, harvesting a richer toolchain and ease of getting started. A vendor-lock seems to be less a concern for my colleagues, even though I do not agree as somebody who is paid from public money.

Anyway, this blog post is not yet-another-OpenCL-vs-CUDA discussion. Instead, it provides three suggestions on how OpenCL could become more attractive for software library developers to grow the OpenCL library ecosystem. Only if OpenCL libraries provide 90+ percent of the functionality a user needs, the user will be willing to spend the time on getting the remaining percent (if any) done. Continue reading →

Latency Comparison of Lua, OpenCL, and native C/C++

Just-in-time compilation is an appealing technique for producing optimized code at run time rather than at compile time. In an earlier post I was already looking into the just-in-time compilation overhead of various OpenCL SDKs. This blog post looks into the cost of launching OpenCL kernels on the CPU and compares with the cost of calling a plain C/C++ function through a function pointer, and with the cost of calling a precompiled Lua script. Continue reading →

Multi-Threading in C/C++: Implications on Software Library Design

With the increase in parallelism in response to a stagnation of clock frequencies, software libraries are pushed towards multi-threading. However, there are several different threading approaches out there: The most popular in the C/C++ world are POSIX Threads (pthread), OpenMP, and C++11 threads. Clearly, a good software library does not enforce the use of one particular approach, but is able to deal with (almost) any multi-threading approach. In this blog post I will discuss a possible software library design to achieve this. Continue reading →

Raspberry Pi: Interfacing Honeywell Humidity and Temperature Sensors

Recently I was toying with a Raspberry Pi 2 and other hardware to get a better idea about the current status of the Internet of Things. Among several sensors, I was also looking into a Honeywell HIH8131 sensor (around 25 Euros, obtained from Reichelt). Unfortunately, none of the solutions I found on the web for reading the sensor worked for me, so I finally went down into the low-level details of communicating via the I2C bus through the Linux kernel. And I enjoyed it!
Continue reading →

GEMM and STREAM Results on Intel Edison

Intel Edison is a tiny computer (smaller in size than a credit card) targeted at the Internet of Things. Its CPU consists of two Silvermont Atom-CPUs running at 500 MHz and is offered for a price tag of around 70 US dollars. Even though Intel Edison is not designed for high performance computing, the design goal of low power consumption makes it nevertheless interesting to look at from a high performance computing perspective. Let us have a closer look.
Continue reading →

Sparse Matrix Transposition: Datastructure Performance Comparison

While processor manufacturers repeatedly emphasize the importance of their latest innovations such as vector extensions (AVX, AVX2, etc.) of the processing elements, proper placement of data in memory is at least equally important. At the same time, generic implementations of many different data structures allow one to (re)use the most appealing one quickly. However, the intuitively most appropriate data structure may not be the fastest. Continue reading →

Strided Memory Access on CPUs, GPUs, and MIC

Optimization guides for GPUs discuss in length the importance of contiguous ("coalesced", etc.) memory access for achieving high memory bandwidth (e.g. this parallel4all blog post). But how does strided memory access compare across different architectures? Is this something specific to NVIDIA GPUs? Let's shed some light on these questions by some benchmarks. Continue reading →

Join the PETSc User Meeting 2016!

PETSc, the Portable, Extensible Toolkit for Scientific Computing, is one of the world's most widely used software libraries for high-performance computational science. With most of the PETSc core team employed at the Argonne National Laboratory in Illinois, USA, exchange with the European user community was hampered by geographic distance. This year, the PETSc team will reach out to Europe and hold the PETSc User Meeting 2016 on June 28-30 in Vienna, Austria. Continue reading →

Karl Rupp

Computational Scientist