Asking about linear transformations in the context of Gram-Schmidt is (generally) the wrong question. That is because Gram-Schmidt is about signal representation, not about linear transformation. It deals with general vector spaces $V$. For simplicity assume $V=\mathbb{R}^k$.
We have:
Positive integers $k$ and $n$.
A general collection of $n$ vectors in $\mathbb{R}^k$ (not all of which are zero): $\{u_1, …, u_n\}$.
Define $U$ as the span of these vectors, being the set of all linear combinations of the vectors:
\begin{align}
U &= Span(\{u_1, ..., u_n\}) \\
&= \left\{u \in \mathbb{R}^k : u = \sum_{i=1}^n x_i u_i \quad \mbox{ for some real numbers $x_1,...,x_n$}\right\}
\end{align}
Notice that span $U$ does not depend on the order in which we list the vectors $\{u_1, ..., u_n\}$, we can permute the order however we like and it does not change $U$.
If we create a real-valued $k\times n$ matrix $A$ with columns equal to the $\{u_1, ..., u_n\}$ vectors, so that the first column of $A$ is the vector $u_1$, the second column of $A$ is the vector $u_2$, and so on, then $U$ is equal to the column space of $A$:
$$ U = Span(\{u_1, ..., u_n\}) = \{ u : u=Ax \mbox{ for some $x \in \mathbb{R}^n$}\}$$
Permuting the columns of $A$ does not change its column space. Now every matrix $A$ happens to define a linear transformation but we do not care about this. Permuting the columns of a matrix creates different linear transformations (but we do not care about this). We only care about the subspace $U$, and about efficiently representing vectors in this subspace.
What we can do with this:
Any vector $u \in U$ can be represented by a (possibly non-unique)
$n$-tuple $(x_1, …, x_n) \in \mathbb{R}^n$. Given $(x_1, …, x_n)$, we can obtain $u$ by:
$$ u = \sum_{i=1}^n x_i u_i = Ax$$
The total “energy” in the vector $u$ can be obtained via:
$$ ||u||^2 = \sum_{i=1}^n \sum_{j=1}^n x_i x_j u_i^Tu_j = x^TA^TAx$$
Is the tuple $(x_1, ..., x_n)$ the most efficient way of representing a vector $u \in U$? (Generally no).
Enter Gram-Schmidt
We get a procedure for taking the vectors $\{u_1, …, u_n\}$ in $\mathbb{R}^k$ (not all of which are zero) and producing an orthonormal list of vectors $\{v_1, .., v_m\}$ (where $m \leq n$) with the property that
$$U=Span(u_1, .., u_m) = Span(v_1, .., v_m)$$
If we form a real-valued $k \times m$ matrix $B$ by stacking the vector $\{v_1, ..., v_m\}$ as columns, then $U$ is the column space of $B$. Now matrix $B$ is not necessarily the same size as matrix $A$, but both $A$ and $B$ have the same column space (that is the only relationship between $A$ and $B$ that we care about).
It follows that every vector $u \in U$ can now be uniquely represented by
a tuple $y=(y_1, ..., y_m) \in \mathbb{R}^m$:
$$ u = \sum_{i=1}^m v_i y_i = By$$
Further, the energy is easy to compute:
$$ ||u||^2 = y^TB^TBy = y^Ty = ||y||^2 = \sum_{i=1}^m y_i^2$$
Thus, the only reason we care about Gram-Schmidt is that it gives us a nicer representation of vectors in the subspace $U$. At first, that seems to diminish the value of Gram-Schmidt. Not at all: The reason you hear about Gram-Schmidt is that it is very important to give nice representations of things.
Now if we happen to start with a linear transformation $T:V\rightarrow W$ for some vector spaces $V, W$, then it may make sense to try to represent this transformation efficiently, so we might want to find an orthonormal basis for $V$, and another for $W$, then represent the transformation $T$ by how it transforms basis vectors in $V$ to linear combinations of basis vectors in $W$, and we can define the matrix of $T$ with respect to this, and so on.