I was thinking about the same question recently. I think now I get a rather satisfactory answer, I will just post it here.
After some search I think I can now see things a bit clearer. I will try to clarify my thoughts and do a summary.
Initially I was surprised by the fact that gradient in non-Cartesian coordinates has a different formula. For example, gradient in 2D polar coordinates is:
$$
\nabla f = \frac{\partial f}{\partial r} \mathbf{\hat{e}}_r + \frac{1}{r}\frac{\partial f}{\partial \theta} \mathbf{\hat{e}}_\theta
$$
Searching on that problem led me to curvilinear coordinates, where $d\mathbf{r}$ is treated and manipulated like a differential form. However, that's not what I thought what differential form is, as differential form is defined to be a real valued function. I suspected there's a corresponding 'vector-valued' differential form definition, that was why I asked this question.
However thinking about this a bit more, this shouldn't be the case. $d\mathbf{r}$ is used in line integral, which is written as
$$
\oint_\gamma f \cdot d\mathbf{r}
$$
If we use differential form to interpret the integral, the integral is equivalent to
$$
\oint_\gamma f_x dx + f_y dy
$$
Here apparently $f_x dx + f_y dy$ is the differential form. So if there is a differential form, it should be $f \cdot d\mathbf{r}$ not $d\mathbf{r}$.
So back to the original question on the gradient in non-Cartesian coordinates. I found an alternative way to prove the result in the framework of differential geometry without manipulating $d\mathbf{r}$ directly.
First it's important to note why gradient behaves a bit non-intuitively in non-Cartesian coordinates. Although gradient looks like the differential (total derivative) in the form of $\begin{pmatrix}\frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y}\end{pmatrix}$, it's not differential. At point $p$, it is defined as
$$
g(\nabla f, \mathbf{v}) = df(\mathbf{v})
$$
where $g$ is the inner product on the tangent space $T_p$, and $df$ is the one form (or total derivative, or differential, there're so many names...), which eats a vector and results in a real number as usual. The important thing to note is the involvement of the inner product. Therefore it is reasonable that the gradient is dependent on the inner product. Thus if the inner product takes the simplest form in Cartesian coordinates, and more complicated form in other coordinate systems, then it makes sense to have gradient with more complicated form in non-Cartesian coordinate system.
The above equation means $g$ can also be viewed as a function of $T_p \to T_p^*$ that eats a tangent vector $\nabla f$ and results in a one form $df$. Now assume the coordinate system $q_i$ has orthogonal basis at point $p$, then $g(\frac{\partial}{\partial q_i}, \frac{\partial}{\partial q_j}) = 0$ and $g(\frac{\partial}{\partial q_i}, \frac{\partial}{\partial q_i}) = \lvert \frac{\partial}{\partial q_i} \rvert^2 = h_i^2$. Given a tangent vector $\sum a_i\frac{\partial}{\partial q_i}$
$$
\begin{aligned}
g(\sum a_i\frac{\partial}{\partial q_i}) &= \sum h_i^2a_idq_i
\end{aligned}
$$
which is a one form.
Another thing is that we want express $\nabla f$ to using orthonormal vector basis, rather than just orthogonal basis. The orthonormal basis is simply $\frac{1}{h_i} \frac{\partial}{\partial q_i}$, put it into previous equation we get
$$
\begin{aligned}
g(\sum \frac{1}{h_i}\frac{\partial}{\partial q_i}) &= \sum h_idq_i
\end{aligned}
$$
Now since $df_p = \sum \frac{\partial f}{\partial q_i}dq_i$, to find the gradient we need to find $a_i$ such that
$$
\begin{aligned}
g(\sum a_i\frac{1}{h_i}\frac{\partial}{\partial q_i}) &= \sum \frac{\partial f}{\partial q_i}dq_i
\end{aligned}
$$
Therefore $a_i = \frac{1}{h_i}\frac{\partial f}{\partial q_i}$, hence we get the result
$$
\begin{aligned}
\nabla f = \sum \frac{1}{h_i} \frac{\partial f}{\partial q_i} \mathbf{\hat{e}}_i
\end{aligned}
$$
Now to calculate gradient in an orthogonal coordinates, we need to calculate $h_i$. Here we relate the coordinate basis to Cartesian basis, where $\lvert \frac{\partial}{\partial x_i} \rvert = 1$.
$$
\begin{aligned}
h_i
&= \lvert \frac{\partial}{\partial q_i} \rvert \\
&= \lvert \sum_j \frac{\partial x_j}{\partial q_i} \frac{\partial}{\partial x_i} \rvert
\end{aligned}
$$
Consider polar coordinate system, where $x = r\cos \theta$ and $y = r\sin \theta$. We get
$$
\begin{aligned}
\frac{\partial x}{\partial r} &= \cos \theta \\
\frac{\partial y}{\partial r} &= \sin \theta
\end{aligned}
$$
Therefore $h_r = \lvert \cos \theta \frac{\partial}{\partial x} + \sin \theta \frac{\partial}{\partial y} \rvert = 1$.
$$
\begin{aligned}
\frac{\partial x}{\partial \theta} &= -r\sin \theta \\
\frac{\partial y}{\partial \theta} &= r\cos \theta
\end{aligned}
$$
Therefore $h_\theta = \lvert r\cos \theta \frac{\partial}{\partial x} - r\sin \theta \frac{\partial}{\partial y} \rvert = r$.
Now we get gradient in polar coordinates
$$
\nabla f = \frac{\partial f}{\partial r} \mathbf{\hat{e}}_r + \frac{1}{r}\frac{\partial f}{\partial \theta} \mathbf{\hat{e}}_\theta
$$
Some good references that helped me:
Definition of the gradient for non-Cartesian coordinates
https://www.math.arizona.edu/~faris/methodsweb/manifold.pdf