Distributed-memory ℋ-matrix Algebra I: Data Distribution and Matrix-vector Multiplication
We introduce a data distribution scheme for ℋ-matrices and a distributed-memory algorithm for ℋ-matrix-vector multiplication. Our data distribution scheme avoids an expensive Ω(P^2) scheduling procedure used in previous work, where P is the number of processes, while data balancing is well-preserved. Based on the data distribution, our distributed-memory algorithm evenly distributes all computations among P processes and adopts a novel tree-communication algorithm to reduce the latency cost. The overall complexity of our algorithm is O(N log N/P + αlog P + βlog^2 P ) for ℋ-matrices under weak admissibility condition, where N is the matrix size, α denotes the latency, and β denotes the inverse bandwidth. Numerically, our algorithm is applied to address both two- and three-dimensional problems of various sizes among various numbers of processes. On thousands of processes, good parallel efficiency is still observed.
READ FULL TEXT