One way to motivate dual spaces and transposes is to consider differentiation of scalar-valued functions of several variables. The basic point is that functionals are the easiest functions to deal with short of constant functions, so that differentiation is essentially approximation by a unique functional such that the error in the approximation is sufficiently well behaved. Moreover, transposes arise naturally when differentiating, say, the composition of a scalar-valued function with a change of coordinates.
Let $f : (a,b) \to \mathbb{R}$. Conventionally, one defines $f$ to be differentiable at $x \in (a,b)$ if the limit
$$
\lim_{h \to 0} \frac{f(x+h)-f(x)}{h}
$$
exists, in which case the value of that limit is defined to be the derivative $f^\prime(x)$ of $f$ at $x$. Observe, however, that this definition means that for $h$ small enough,
$$
f(x+h)-f(x) = f^\prime(x)h + R_x(h),
$$
where $h \to f^\prime(x)h$ defines a linear transformation $df_x :\mathbb{R} \to \mathbb{R}$ approximating $f$ near $x$, and where the error term $R_x(h)$ satisfies
$$
\lim_{h \to 0} \frac{R_x(h)}{h} = 0.
$$
In fact, $f$ is differentiable at $x$ if and only if there exists a linear transformation $T : \mathbb{R} \to \mathbb{R}$ such that
$$
\lim_{h \to 0} \frac{\lvert f(x+h) - f(x) - T(h) \rvert}{\lvert h \rvert} = 0,
$$
in which case $df_x := T$ is unique, and given by multiplication by the scalar $f^\prime(x) = T(1)$.
Now, let $f : U \to \mathbb{R}^m$, where $U$ is an open subset of $\mathbb{R}^n$. Then, we can still perfectly define $f$ to be differentiable at $x \in U$ if and only if there exists a linear transformation $T : \mathbb{R}^n \to \mathbb{R}^m$ such that
$$
\lim_{h \to 0} \frac{\| f(x+h) - f(x) - T(h) \|}{\|h\|} = 0,
$$
in which case $df_x := T$ is unique; in particular, for $\|h\|$ small enough,
$$
f(x+h) - f(x) = df_x(h) + R_x(h),
$$
where $df_x$ gives a linear approximation of $f$ near $x$, such that the error term $R_x(h)$ satisfies
$$
\lim_{h \to 0} \frac{R_x(h)}{\|h\|} = 0.
$$
At last, let's specialise to the case where $f : U \to \mathbb{R}$, i.e., where $m=1$. If $f$ is differentiable at $x$, then $df_x : \mathbb{R}^n \to \mathbb{R}$ is linear, and hence $df_x \in (\mathbb{R}^n)^\ast$ by definition. In particular, for any $v \in \mathbb{R}^n$, the directional derivative
$$
\nabla_v f(x) := \lim_{\epsilon \to 0} \frac{f(x+\epsilon v) - f(x)}{\epsilon}
$$
exists and is given by
$$
\nabla_v f(x) = (d_x f)(v).
$$
Moreover, the gradient of $f$ at $x$ is exactly the unique vector $\nabla f(x) \in \mathbb{R}^n$ such that
$$
\forall v \in \mathbb{R}^n, \quad (d_x f)(v) = \langle \nabla f(x), v \rangle.
$$
In any event, the derivative of a scalar-valued function of $n$ variables at a point is most naturally understood as a functional on $\mathbb{R}^n$, i.e., as an element of $(\mathbb{R}^n)^\ast$.
Now, suppose, for simplicity, that $f : \mathbb{R}^n \to \mathbb{R}$ is everywhere-differentiable, and let $S : \mathbb{R}^p \to \mathbb{R}^n$ be a linear transformation, e.g., a coordinate change $\mathbb{R}^n \to \mathbb{R}^n$. Then $f \circ S$ is indeed everywhere differentiable with derivative $$d_y(f \circ S) = (d_{Sy} f) \circ S = S^t d_{Sy} f,$$ at $y \in \mathbb{R}^p$. On the one hand, if $S = 0$, then $f \circ S = f(0)$ is constant, so that $d_y(f \circ S) = 0 = S^t d_{Sy} f$, as required. On the other hand, if $S \neq 0$, so that
$$
\|S\| := \sup_{k \neq 0} \frac{\|Sk\|}{\|k\|} > 0,
$$
it follows that
$$
\frac{\|(f \circ S)(y+k)-(d_{Sy} f \circ S)(k)\|}{\|k\|} = \frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|k\|} \leq \|S\|\frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|Sk\|} \to 0, \quad k \to 0
$$
by differentiability of $f$ at $Sy$ and continuity of the map
$$
k \mapsto \|S\|\frac{\|f(Sy + Sk) - d_{Sy}f(Sk)\|}{\|Sk\|}.
$$
More concretely, once you know that $f \circ S$ is differentiable everywhere, then for each $v \in \mathbb{R}^n$, by linearity of $S$,
$$
(f \circ S)(y + \epsilon v) = f(Sy + \epsilon Sk),
$$
so that, indeed
$$
\left(d_y(f \circ S)\right)(k) = \nabla_k(f \circ S)(y) = \nabla_{Sk}f(Sy) = (d_{Sy}f)(Sk) = (S^t d_{Sy}f)(k).
$$
In general, if $S : \mathbb{R}^p \to \mathbb{R}^n$ is everywhere differentiable (again, for simplicity), then
$$
d_y (f \circ S) = (d_{Sy}f) \circ d_y S = (d_y S)^t d_{Sy}f,
$$
which is none other than the relevant case of the chain rule.