Given a derivative, say, as a tensor, is the corresponding linear approximation unique?

Question

I try to compute the linear approximation to, say, $f(A)=A^{-1}$ using a result such as http://www.matrixcalculus.org/ that returns $-A^{-1}\otimes A^{-1}$ which is not a linear function but a matrix (Kronecker product). I guess that they mean that the linear approximation is the function that gets a $d\times d$ matrix $X$ and returns $-A^{-1} XA^{-1}$. What are the hidden assumptions in this interpretation?

Similarly, in here, as all the other places I read, it is assumed that the linear approximation uses the Frobenius/standard inner product (trace). Is this the only possible interpreation? How can I derive it myself?

I know that the norms $(V,||.||_V)$ and $(W,||.||_W)$ are equivalents in such finite dimensional spaces (so we can assume Frobenius norm wlog), and that the derivative (linear function) $D_f:V\to W$ is unique, but it is only given specific vector spaces $V$ and $W$. In the questions above we may define the (undefined) domain of $f$ to be any vector space that contains the matrix $A$ (symmetric matrices, kenels..), and we also have many options for the image vector space $W$.

More trivially, in the books they write that the (Frechet) derivative of $2x$ is $2$, but such a derivative must be a linear function. They of course mean that the derivative of $f:V\to W$ that maps $x$ to $2x$ is the linear function $f':V\to W$ that maps every $x$ to $2$ where $V=W=R$. Can't we define different subset $V$ of $R$ with different product operator (still norm) that would yield another answer? Say, binary numbers and operators?

The question is also related to the big confusions and mistakes over the years in this forum and famous books, as explained in: https://arxiv.org/abs/1911.06491

Hope it is clear enogh and many thanks in advance.

Qiaochu Yuan · Answer 1 · 2022-09-23T16:08:08.863

What are the hidden assumptions in this interpretation?

There are no hidden assumptions. Here is how to do this calculation: if $A$ is an invertible matrix, then for $\varepsilon$ a sufficiently small matrix (with respect to any matrix norm), the perturbation $A + \varepsilon$ remains invertible, and

$$\begin{align*} (A + \varepsilon)^{-1} &= ((I + \varepsilon A^{-1}) A)^{-1} \\ &= A^{-1} (I + \varepsilon A^{-1})^{-1} \\ &= A^{-1} (I - \varepsilon A^{-1} + O(\varepsilon^2)) \\ &= A^{-1} - A^{-1} \varepsilon A^{-1} + O(\varepsilon^2) \end{align*} $$

so we get that, as desired, the Frechet derivative is the linear map $\varepsilon \mapsto - A^{-1} \varepsilon A^{-1}$. This argument is valid with respect to any matrix norm. (Strictly speaking we need a small argument involving the convergence of a geometric series to show that that $O(\varepsilon^2$) is justified but this is a standard von Neumann series argument.)

More generally, the derivative of a linear map between finite-dimensional vector spaces (including spaces of matrices) can be computed as a linear approximation with respect to any choice of norms and does not depend on that choice.

it is assumed that the linear approximation uses the Frobenius/standard inner product (trace). Is this the only possible interpreation? How can I derive it myself?

I don't see where the Frobenius inner product is used in the linked post to define what a linear approximation is. It is used to write down linear maps; e.g. if one wants to differentiate a scalar-valued matrix function $f : M_n \to \mathbb{R}$ the result is a linear functional on $M_n$ and any such linear functional can be written as $\varepsilon \mapsto \text{tr}(A \varepsilon)$ for a unique matrix $A$, so it's convenient to describe such derivatives by identifying them with the corresponding matrices $A$. I'm under the impression this is a standard convention in many places that is used without comment. (There's also a further question of whether we should take the transpose of that matrix or not but this gets into annoying issues like the difference between a derivative and a gradient.)

In the questions above we may define the (undefined) domain of $f$ to be any vector space that contains the matrix $A$ (symmetric matrices, kenels..), and we also have many options for the image vector space $W$.

None of those choices affect the Frechet derivative if it's calculated correctly. The paper you link is disturbing but it also clearly explains how carefully applying the definition of the Frechet derivative solves everything. The confusion is, among other things, about gradients, which involve a choice of inner product and which depend on that choice, and also about how these choices interact with passing to subspaces such as symmetric matrices.

More trivially, in the books they write that the (Frechet) derivative of $2x$ is $2$, but such a derivative must be a linear function. They of course mean that the derivative of $f:V\to W$ that maps $x$ to $2x$ is the linear function $f':V\to W$ that maps every $x$ to $2$ where $V=W=R$. Can't we define different subset $V$ of $R$ with different product operator (still norm) that would yield another answer? Say, binary numbers and operators?

The function which maps $x$ to $2$ is not linear. Every linear function is its own Frechet derivative; "$2$" is shorthand for the linear map $x \mapsto 2x$. This calculation is valid for the map $x \mapsto 2x$ on any normed vector space. I don't know what you mean by "different product operator."

Thank you for the detailed answer.
about gradients, which involve a choice of inner product and which depend on that choice, and also about how these choices interact with passing to subspaces such as symmetric matrices. — Dan Feldman, Sep 24 '22 at 16:31
Unfortunately, edit is allowed only for 5 minutes so here it is again: Thank you for the detailed answer. Small comments: (i) By "Frobenius inner product" I meant the trace of $A^TB$, as defined in the paper, and (ii) $O(\varepsilon^2)$ holds only in this example, but more generally we should ise $o(\varepsilon)$, right? Can you please give more details about " with respect to any choice of norms and does not depend on that choice" in the first part, compared to "about gradients, which involve a choice of inner product and which depend on that choice" in the second part. — Dan Feldman, Sep 24 '22 at 16:45
i) Yes, $o(\varepsilon)$ in general. ii) On a finite-dimensional vector space every pair of norms is equivalent. You can use this to show that the Frechet derivative of a map between finite-dimensional vector spaces does not depend on the choice of norm used in the definition. The gradient, on the other hand, starts from a derivative and then applies a map coming from an inner product, and it depends on this choice. — Qiaochu Yuan, Sep 25 '22 at 03:37

Given a derivative, say, as a tensor, is the corresponding linear approximation unique?

1 Answers1