- Category: Information Science and Technology , Sociology
- Topic: Communication
This written work delves into the CA-GMRES method, a communication-avoiding approach to the GMRES method, developed by Hoemmen et al. in their PhD thesis titled "Communication avoiding Krylov Subspace Methods" which centres around solving large and sparse systems of linear equations, both symmetric and nonsymmetric.
One of the main advantages of CA-GMRES is being able to reduce communication by a factor s in both sequential and parallel settings, while still maintaining the same rate of convergence as with the standard GMRES implementation. This is achieved by allowing separate choices for the length of basis and restart, which improves numerical stability and enhances convergence, leading to notable performance gains.
The feasibility and efficacy of the CA-GMRES method was tested in a shared-memory environment and compared against a standard parallel version. Additionally, the ILU(0) preconditioner was integrated and its effects analysed to produce further numerical and performance results in support of Hoemmen et al.'s work.
It is known that algorithms incur costs from arithmetic computations and communication, with communication expenses proving to be significantly higher. To mitigate this issue, constructing communication-avoiding algorithms is necessary, especially given the growing divide between CPU-memory performance coupled with the increasingly parallel computer hardware systems.
Communication encompasses the transport of data between different processors in parallel systems or between fast and slow memory levels in sequential cases. Communication-optimal algorithms aim to prioritise the reduction of communication while still maintaining it as a necessary aspect of the algorithm. However, communication avoidance often demands alternative techniques and algorithms to circumvent numerical instabilities and produce well-conditioned basis vectors.
Krylov subspace methods are iterative algorithms that can be used to solve large, sparse linear systems in real or complex arithmetic. Given the indispensable role of Krylov methods in scientific computations, it is reasonable to endeavour to improve these algorithms by reducing their communication costs. However, Krylov methods differ broadly in various factors, such as recurrence durations, matrix-specific functions, among others. For an overview of these variations please refer to Hoemmen et al. [8].
The GMRES method, in particular, is a one-sided Krylov subspace method that uses long recurrences and can solve for general matrices. The goal of the communication-avoiding CA-GMRES version is to achieve s steps for the same communication expense as 1 step, which is essentially optimal.
The remainder of this thesis is structured in the following order. Section 2 provides an introduction to GMRES and its fundamental principles. Section 3 enumerates essential kernels and elucidates on how communication avoidance is instituted. Section 4 delves into the s-step-based Arnoldi process as proposed by Hoemmen et al. [8]. In section 5, the CA-GMRES algorithm is expounded on, with an explanation on how a preconditioner can be applied, and addresses implementation intricacies. Sections 6 and 7 showcase the numerical and performance experiments' outcomes, respectively, while section 8 summarises and draws conclusions from the findings.
The GMRES method is a powerful iterative Krylov subspace algorithm designed to solve large and sparse linear systems. It begins by initializing an approximate solution, x0, and calculating the initial residual, r0 = b - Ax0. The method then calculates the correction vector, zk, at each iteration k, which solves the least-squares problem, minimizing the norm of the difference between the system's right-hand side, b, and the product of the matrix A with the sum of x0 and zk. The vector zk is determined in the Krylov subspace Kk(A; r0), which is spanned by the initial residual and its successive products with A.
The solution at iteration k is then formed by adding zk to x0, such that xk = x0 + zk. Since the vectors fr0, Ar0, ..., Ak-1r0 are usually ill-conditioned, the Arnoldi method is used to produce k+1 orthonormal basis vectors. These vectors are stored in the matrix Q = [q1, q2, ..., qk, qk+1], with the first vector q1 normalized to kr0k2. The method calculates a k+1 by k upper Hessenberg matrix H such that AQ = QH. The correction vector zk is then defined as z = Qy, where y is the solution to the minimization problem defined by the norm of the residual.
In order to factor H into a product of a square matrix G and an upper triangular matrix U, the method applies k Givens rotations to H. The resulting vector g is defined as _GT e1, where e1 is the first column of the identity matrix and _ is the norm of r0. The solution at iteration k is then obtained by solving the triangular system defined by U and g, such that yk = argminy (kg - Uyk)2, and xk = x0 + Qyk. The absolute value of the norm of the residual at iteration k is given by kb - Axkk2, and it indicates the algorithm's convergence.
In this thesis, computational kernels refer to the parts of an algorithm that have significantly high costs, including both arithmetic operations and communication. The essential building blocks in the Arnoldi(s, t) and CA-GMRES, discussed in Sections 4 and 5, respectively, are the following computational kernels:
- Matrix powers kernel: This kernel takes input an n by n matrix A and a starting vector v1, and it produces s more vectors Av, A2v, ..., Asv. In s-step methods, the MPK replaces the sparse matrix-vector products that generate the basis for the Krylov subspace Ks+1(A, v). Since A is usually large and sparse, the graph of A, denoted as G(A), is an essential consideration to apply known graph algorithms. The MPK helps to avoid communication.
This passage pertains to the use of parallel computing techniques to solve matrices efficiently. One such technique is the matrix-vector product (MPK), which involves distributing data and workload among P processors to reduce the number of messages sent. This method allows for the computation of multiple vectors without communication, provided that the appropriate sets are available. However, as the number of steps increases, so does the amount of ghosted data and floating-point operations.
Another approach to solving matrices is through preconditioning, which can fundamentally change the MPK. Parallelizable preconditioners are preferred to avoid communication. For instance, Nuentsa et al. developed a parallel GMRES with a multiplicative Schwarz preconditioner, while Grigori et al. came up with CA-ILU(0), a seemingly inappropriate type of preconditioner that proved effective in a parallel and communication avoiding environment.
TSQR is a QR decomposition algorithm suitable for m _ n matrices, where m _ n. TSQR uses a divide-and-conquer approach and a reduction tree structure, ranging from a purely sequential linear tree to a highly parallelized binary one. For instance, in a P=4 matrix, the QR factorization for each submatrix must be computed until one last QR factorization is reached. The parallel TSQR is highly advantageous due to its accuracy and fast processing, provided that the appropriate tree structure is used for the matrix size and underlying architecture.
Overall, the use of parallel computing techniques and preconditioners can significantly improve the efficiency of matrix solving, especially for tall and skinny matrices where conventional methods may take longer.
Matrix Decomposition
A matrix A can be decomposed into several components using various methods. For instance, A can be written as:
A =
0
BB@
A0
A1
A2
A3
1
CCA
Similarly, the matrix CCA can be decomposed as follows:
CCA =
0
BB@
Q00
Q10
Q20
Q30
1
CCA
_
_
Q01
Q11
_
_
Q02
R02: (3.1)
Figure 1 depicts that this approach requires fewer messages (O(log2 P)) than Householder QR or MGS for P processors. In addition, it firmly reduces the transferred data between memory hierarchy levels. Equation (3.1) defines the Q factor's orthogonality, which relies on A's condition number. Householder QR, on the other hand, is unconditionally stable. Therefore, it appears as the best choice for local QR factorizations in TSQR, making TSQR unconditionally stable.
In some cases, it is preferable for TSQR to generate an R factor with real nonnegative diagonal entries. In [3], Demmel et al. have proposed a numerically stable alternative for the typical Householder QR to incorporate it into TSQR. This modified Householder QR factorization can help produce an R factor with real nonnegative diagonal entries.
Block Gram-Schmidt
The Gram-Schmidt process generates an orthonormal basis for a set of s linearly independent basis vectors V. Blocked Gram-Schmidt algorithms work on blocks of columns instead of one column at a time. This approach substantially reduces the messages and data transfers between memory hierarchy levels.
In their performance analysis using a simplified parallel model, Hoemmen et al. [8] have proven that the blocked classical Gram-Schmidt outperforms the blocked modified Gram-Schmidt. However, the former is less numerically stable than the latter. The modified Gram-Schmidt method is commonly employed for basis orthogonalization in the Arnoldi iteration (Section 4). Nonetheless, Greenbaum et al. showed in [6] that orthogonality near machine precision is not as vital as the linear independence of the Arnoldi basis while solving linear systems with GMRES. Thus, the CA-GMRES algorithm (Section 5) can be effectively used.