Software for Sparse Tensor Decomposition on Emerging Computing Architectures

09/24/2018
by   Eric Phipps, et al.
0

In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP) which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.

READ FULL TEXT
research
04/06/2019

Load-Balanced Sparse MTTKRP on GPUs

Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the...
research
05/28/2017

A Unified Optimization Approach for Sparse Tensor Operations on GPUs

Sparse tensors appear in many large-scale applications with multidimensi...
research
05/03/2017

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

We introduce the CUDA Tensor Transpose (cuTT) library that implements hi...
research
04/14/2017

HPTT: A High-Performance Tensor Transposition C++ Library

Recently we presented TTC, a domain-specific compiler for tensor transpo...
research
12/19/2017

Accelerating the computation of FLAPW methods on heterogeneous architectures

Legacy codes in computational science and engineering have been very suc...
research
10/21/2021

FlexTOE: Flexible TCP Offload with Fine-Grained Parallelism

FlexTOE is a flexible, yet high-performance TCP offload engine (TOE) to ...
research
09/03/2013

Online Tensor Methods for Learning Latent Variable Models

We introduce an online tensor decomposition based approach for two laten...

Please sign up or login with your details

Forgot password? Click here to reset