Least Squares on GPUs in Multiple Double Precision

10/15/2021
by   Jan Verschelde, et al.
0

This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision arithmetic. For the blocked Householder QR and the back substitution, of interest are those dimensions at which teraflop performance is attained. The other interesting question is the cost overhead factor that appears each time the precision is doubled. Experimental results are reported on five different NVIDIA GPUs, with a particular focus on the P100 and the V100, both capable of teraflop performance. Thanks to the high Compute to Global Memory Access (CGMA) ratios of multiple double arithmetic, teraflop performance is already attained running the double double QR on 1,024-by-1,024 matrices, both on the P100 and the V100. For the back substitution, the dimension of the upper triangular system must be as high as 17,920 to reach one teraflops on the V100, in quad double precision, and then taking only the times spent by the kernels into account. The lower performance of the back substitution in small dimensions does not prevent teraflop performance of the solver at dimension 1,024, as the time for the QR decomposition dominates. In doubling the precision from double double to quad double and from quad double to octo double, the observed cost overhead factors are lower than the factors predicted by the arithmetical operation counts. This observation correlates with the increased performance for increased precision, which can again be explained by the high CGMA ratios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2023

GPU Accelerated Newton for Taylor Series Solutions of Polynomial Homotopies in Multiple Double Precision

A polynomial homotopy is a family of polynomial systems, typically in on...
research
01/22/2021

Accelerated Polynomial Evaluation and Differentiation at Power Series in Multiple Double Precision

The problem is to evaluate a polynomial in several variables and its gra...
research
10/29/2015

Performance evaluation of multiple precision matrix multiplications using parallelized Strassen and Winograd algorithms

It is well known that Strassen and Winograd algorithms can reduce the co...
research
12/11/2020

Parallel Software to Offset the Cost of Higher Precision

Hardware double precision is often insufficient to solve large scientifi...
research
09/15/2023

Speeding up the GENGA N-body integrator on consumer-grade graphics cards

GPU computing is popular due to the calculation potential of a single ca...
research
03/08/2023

Cascading GEMM: High Precision from Low Precision

This paper lays out insights and opportunities for implementing higher-p...
research
05/16/2021

Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs

Support for lower precision computation is becoming more common in accel...

Please sign up or login with your details

Forgot password? Click here to reset