Hao Xie


Ph.D. candidate at IOP, CAS


Automatic differentiation of truncated SVD

Truncated SVD (Singular Value Decomposition) is an important linear algebra operation that has wide applications in various tensor network algorithms. In this article, we will study the (reverse mode) automatic differentiation of truncated SVD. Due to the close relation between SVD and eigen-decomposition, it turns out that one is faced with similar difficulties when trying to derive the back-propagation formulas for truncated SVD and dominant eigensolver. In view of this, the presentation here is pretty parallel to that of dominant eigensolver in previous articles. In particular, the adjoint method and the approach based on full SVD are used to derive the desired results. These two approaches are completely equivalent and reveal the mechanism behind the scene that effectively separates the information we want out of the full spectrum.

$\renewcommand{\vec}[1]{\boldsymbol{#1}}$

Some mathematical backgrounds of SVD


The study of automatic differentiation of truncated SVD through the adjoint method requires some detailed mathematical understanding of SVD. In this section, we will give necessary backgrounds in this regard.

Let $A$ be an arbitrary $M \times N$ real matrix, where $M$ is generally not equal to $N$. We will derive the SVD of $A$ in a constructive way. It turns out that the SVD of $A$ is intimately related to the eigen-decomposition of the square matrices $A A^T$ and $A^T A$, both of which are symmetric and positive semi-definite. This implies that the eigenvalues of both $A A^T$ and $A^T A$ are non-negative. Furthermore, their eigen-spectrums have close relations, which can be seen from the following propositions:

Proposition 1: Let $A$ be an arbitrary $M \times N$ real matrix.

  1. If an $M$-dimensional vector $\vec{l}$ is the eigenvector of $A A^T$ with eigenvalue $\lambda > 0$, then $A^T \vec{l} \neq \vec{0}$ is the eigenvector of $A^T A$ with the same eigenvalue $\lambda$; if $\vec{l}$ is the eigenvector of $A A^T$ with eigenvalue zero, then $A^T \vec{l} = \vec{0}$.
  2. If an $N$-dimensional vector $\vec{r}$ is the eigenvector of $A^T A$ with eigenvalue $\lambda > 0$, then $A \vec{r} \neq \vec{0}$ is the eigenvector of $A A^T$ with the same eigenvalue $\lambda$; if $\vec{r}$ is the eigenvector of $A^T A$ with eigenvalue zero, then $A \vec{r} = \vec{0}$.

Proposition 2: Let $A$ be an arbitrary $M \times N$ real matrix.

  1. $A A^T$ and $A^T A$ have the same set of non-zero (positive) eigenvalues, including their multiplicities. Let these eigenvalues be denoted as $(\lambda_1, \cdots, \lambda_k)$, one can write

    where $k \leq \min(M, N)$ is the total number of nonzero eigenvalues (including multiplicities), $\lambda_1 \geq \cdots \geq \lambda_k > 0$. $(\vec{l}_1, \cdots, \vec{l}_M)$ and $(\vec{r}_1, \cdots, \vec{r}_N)$ are the complete set of eigenvectors of $A A^T$ and $A^T A$, respectively.

  2. Furthermore, the various eigenvectors in 1 can be chosen to satisfy the following conditions:

The proofs are pretty easy and can be left as an exercise. Proposition 2 is the direct consequence of Proposition 1. Note that generally the square matrices $A A^T$ and $A^T A$ can have different dimensions. Nevertheless, they have exactly the same set of nonzero eigenvalues, which is amazing. In discussions below, we will continue to employ the notations in Proposition 2, and the relevant properties Eqs. $\eqref{eq: leigenvector}$-$\eqref{eq: from r to l}$ will also be extensively used.

From Eqs. $\eqref{eq: normalization}$-$\eqref{eq: from r to l}$, it is easy to derive that for $i = 1, \cdots, M$, $j = 1, \cdots, N$, we have

By introducing the following notations

one can write Eq. $\eqref{eq: lAr}$ in a more compact form as follows:

As indicated by the orthonormality condition $\eqref{eq: normalization}$, both $\tilde{U}$ and $\tilde{V}$ are orthogonal matrices. One can thus obtain from $\eqref{eq: UAV}$ that

where we have introduced the matrices

The columns of $U$ and $V$ consist of the eigenvectors with nonzero eigenvalues of $A A^T$ and $A^T A$, respectively. Eq. $\eqref{eq: SVD}$ is the celebrated SVD (Singular Value Decomposition) of an arbitrary rectangle matrix $A$. The diagonal elements of $S$, $\sqrt{\lambda_i} \equiv s_i > 0, \forall i = 1, \cdots k$, are usually called the (nonzero) singular values of the matrix $A$, while $(\vec{l}_1, \cdots, \vec{l}_k)$ and $(\vec{r}_1, \cdots, \vec{r}_k)$ are the corresponding left singular vectors and right singular vectors, respectively.

Back-propagation of truncated SVD: adjoint method


In this section, we will apply the adjoint method to the automatic differentiation of truncated SVD. The basic setting of the adjoint method can be found in previous articles. Specifically, Let $A = A(\vec{\theta})$ be a general $M \times N$ real matrix that depends on some input parameters $\vec{\theta} = (\theta_1, \cdots, \theta_P)$. For simplicity, consider only one certain singular value $s$ and corresponding left and right singular vector $\vec{l}$, $\vec{r}$, respectively. The case of multiple singular values and singular vectors can then be easily obtained. The output $(\vec{l}, \vec{r}, s)$ is effectively ($M+N+1$)-dimensional, and with the background knowledge developed in the last section, we choose the constraint functions as follows:

where the notation $O_i$ means the $i$th column of the matrix $O$. The condition $f_0(\vec{l}, \vec{r}, s, \vec{\theta}) = 0$ imposes a certain kind of “normalization” constraint. To see its origin, simply note Eq. $\eqref{eq: SVD}$ implies that $U^T A V = S$, since $U^T U = V^T V = I_k$. By focusing on the diagonal element of this equation corresponding to the desired singular value, one immediately obtain the desired condition, that is, $\vec{l}^T A \vec{r} = s$.

Making use of Eq. $\eqref{eq: fs for truncated SVD}$, one can readily derive the back-propagation formula of truncated SVD through the adjoint method. In current case, the general formulation of the adjoint method presented in previous articles reduces to the following form:

The so-called adjoint equation $\left( \frac{\partial f}{\partial \vec{x}} \right)^T \vec{\eta} = \overline{\vec{x}}$ thus reads:

Similar to the case of dominant eigensolver, one could first solve for $\eta_s$ from the first two equations of $\eqref{eq: adjoint equation}$ and obtains

The equality of the last two quantities is not so evident. In fact, this is related to the remaining gauge freedom of the settings. More specifically, conditions $\eqref{eq: fs for truncated SVD}$ doesn’t uniquely determine the singular vectors $\vec{l}$ and $\vec{r}$; in fact, it is invariant under the gauge transformation $\vec{l} \rightarrow c \vec{l}, \vec{r} \rightarrow \frac{\vec{r}}{c}$, where $c$ is an arbitrary nonzero scaling factor. However, simple inspection would indicate that the gradient computation result through back-propagation is actually gauge-independent. Thus, one can set freely the value of the scaling factor $c$ as he wants. The most convenient choice, of course, is such that the normalization $\vec{l}^T \vec{l} = 1$ holds. Then the following conditions (for the chosen gauge) can be immediately obtained:

Owing to the gauge independence, Eq. $\eqref{eq: additional gauge condition}$ can be used for the remaining adjoint method derivations without loss of generality. It is also consistent with Eqs. $\eqref{eq: normalization}$-$\eqref{eq: from r to l}$ for the original purpose of deriving SVD.

By making use of $\eqref{eq: eta s}$ and $\eqref{eq: additional gauge condition}$, the first two equations of $\eqref{eq: adjoint equation}$ can be rewritten as

The general solutions for the vectors $\vec{\eta}_\vec{l}$ and $\vec{\eta}_\vec{r}$ then have the following form, respectively:

Owning to the third equation of $\eqref{eq: adjoint equation}$, $c_\vec{l}$ and $c_\vec{r}$ satisfy the additional condition

As the final step, the desired adjoint of a certain parameter $\theta_\mu$ reads

One can strip the parameter $\vec{\theta}$ out of the primitive by taking account of the fact that $\overline{\theta_\mu} = \textrm{Tr}\left(\overline{A}^T \frac{\partial A}{\partial \theta_\mu}\right)$. We thus finally obtain the adjoint relation $\overline{A} = \overline{A}(\overline{\vec{l}}, \overline{\vec{r}}, \overline{s})$ as follows:

where we have used $\eqref{eq: additional gauge condition}$ and $\eqref{eq: cl and cr}$ in the last two steps. Eqs. $\eqref{eq: xi l}$, $\eqref{eq: xi r}$ and $\eqref{eq: adjoint relation}$ together characterize the automatic differentiation of truncated SVD in a full-spectrum-free form.

Relation with the approach based on full SVD


In this section, we will derive the back-propagation of truncated SVD by wrapping within it the corresponding process of full SVD. The relevant notations are the same as that in the first section. The adjoint relation of the full SVD process $\overline{A} = \overline{A}(\overline{U}, \overline{V}, \overline{S})$ is standard and can be easily obtained. For details, see the references listed in the last section. The final result is:

where $F$ is a $k \times k$ anti-symmetric matrix with off-diagonal elements $F_{ij} = (s_j^2 - s_i^2)^{-1}$ and $\circ$ denotes the Hadamard element-wise product.

Without loss of generality, let the desired singular value $s$ and singular vectors $\vec{l}$, $\vec{r}$ be the “first” one of the full spectrum. In other words, we have

The procedure of wrapping the process of full SVD within the truncated one then implies that the adjoints of $U$, $V$ and $S$ should take the following form:

Substituting $\eqref{eq: U V S l r s}$ and $\eqref{eq: adjoints U V S l r s}$ into Eq. $\eqref{eq: adjoint relation full SVD}$ yields the following results for each term:

To make clear the relation between Eq. $\eqref{eq: adjoint relation explicitly dependent form}$ and the earlier result $\eqref{eq: adjoint relation}$, we should expand the vector $\vec{\xi}_\vec{l}$/$\vec{\xi}_\vec{r}$ characterized by Eq. $\eqref{eq: xi l}$/$\eqref{eq: xi r}$ in the complete basis $(\vec{l}, \vec{l}_2, \cdots, \vec{l}_M)$/$(\vec{r}, \vec{r}_2, \cdots, \vec{r}_N)$, respectively. Making use of the original relations $\eqref{eq: leigenvector}$-$\eqref{eq: normalization}$, it is easy to obtain the expansion coefficients for $\vec{\xi}_\vec{l}$ as follows:

The same result also holds for $\vec{r}_i^T \vec{\xi}_\vec{r}$. The expansion expression of $\vec{\xi}_\vec{l}$ and $\vec{\xi}_\vec{r}$ are thus as follows:

where we have used the completeness relations $U U^T + \sum_{i=k+1}^M \vec{l}_i \vec{l}_i^T = I_M$ and $V V^T + \sum_{i=k+1}^N \vec{r}_i \vec{r}_i^T = I_N$ to eliminate the eigenvectors with eigenvalue zero. Furthermore, by making use of Eqs. $\eqref{eq: from l to r}$ and $\eqref{eq: from r to l}$, we have

Motivated by the expansion expressions $\eqref{eq: xi l expansion}$-$\eqref{eq: A xi r expansion}$, one can readily recognize the relevant pieces and convert the original results $\eqref{eq: adjoint relation explicitly dependent form}$ into a more compact form:

Eq. $\eqref{eq: adjoint relation again}$ is identically the same as the result $\eqref{eq: adjoint relation}$ obtained through the adjoint method.

Note that the original adjoint relation $\eqref{eq: adjoint relation full SVD}$ of full SVD is pretty lengthy. However, when it is used to derive the back-propagation of truncated SVD, this complication is considerably reduced by the vectors $\vec{\xi}_\vec{l}$ and $\vec{\xi}_\vec{r}$. What happened is that these two vectors take care of the components within the eigen-subspace of both nonzero and zero eigenvalues at the same time.

References


There are many texts and lecture notes covering the basics of SVD. For example,

The derivation of the back-propagation (i.e., adjoint relation) of full SVD can be found in, say