HPTT: A High-Performance Tensor Transposition C++ Library

by   Paul Springer, et al.

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design---inspired by BLIS---makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework---guided by a performance model---that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, Intel Knights Landing, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.


page 4

page 5


TTC: A high-performance Compiler for Tensor Transpositions

We present TTC, an open-source parallel compiler for multidimensional te...

TTC: A Tensor Transposition Compiler for Multiple Architectures

We consider the problem of transposing tensors of arbitrary dimension an...

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

We introduce the CUDA Tensor Transpose (cuTT) library that implements hi...

Performance of the Vipera framework for DSLs on micro-core architectures

Vipera provides a compiler and runtime framework for implementing dynami...

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

In this paper, we develop software for decomposing sparse tensors that i...

Porting numerical integration codes from CUDA to oneAPI: a case study

We present our experience in porting optimized CUDA implementations to o...

AnySeq: A High Performance Sequence Alignment Library based on Partial Evaluation

Sequence alignments are fundamental to bioinformatics which has resulted...

Please sign up or login with your details

Forgot password? Click here to reset