It's natural to have some confusion about these things. There are many similar things that come up in differential geometry and smooth manifold theory (and even much of other parts of math) where we take shortcuts or "make identifications" that make our lives easier once we understand their meaning, but can make the uninitiated's life needlessly difficult when it comes time to write proofs and ask if we really understand the shortcuts we take.
For any smooth map $f\colon M\to \mathbb R$ there is the global differential map, $df\colon TM\to T\mathbb R$ defined by
$$
df(p,v) = (f(p),df_p(v)),
$$
and the vector $df_p(v)$ acts on smooth functions $h$ on $\mathbb R$ by $df_p(v)(h) = v(h\circ f)$. For fixed $p\in M$, the map $df_p\colon T_pM\to T_{f(p)}\mathbb R$ is the differential of $\pmb f$ at $\pmb p$. For any point $q\in\mathbb R$, there is a canonical vector space isomorphism $L_q\colon \mathbb R\cong T_{q}\mathbb R$ defined by
$$
L_q(v) = v\frac{d}{dt}\bigg|_q,
$$
i.e., sending the number $v$ to the directional derivative with respect to the "vector" $v$ (which is of course merely multiplication of the number $v$ with the usual derivative operator for smooth functions on $\mathbb R$.) We can compose $L_{f(p)}$ with $df_p$ to get a linear map
$$
\widetilde{df_p} \equiv L_{f(p)}\circ df_p\colon T_pM\to \mathbb R.
$$
Local coordinates $(x^1,\dots,x^n)$ near $p$, give a basis $\partial_{x^1}|_p,\dots,\partial_{x^n}|_p$ for $T_pM$, with respect to which, the linear map $\widetilde{df_p}$ is simply the row vector
$$
\begin{bmatrix} \displaystyle\frac{\partial f}{\partial x^1}(p) & \dotsb & \displaystyle\frac{\partial f}{\partial x^n}(p) \end{bmatrix}.
$$
For $f\colon M\to\mathbb R$, we also have a well-defined covector field $df\colon M\to T^*M$. In local coordinates $(x^1,\dots,x^n)$ near $p$, we can express the covector field $df$ in terms of the local coframe $dx^1,\dots,dx^n$ (dual frame of $\partial_{x^1},\dots,\partial_{x^n}$) as
$$
df = \sum_i\frac{\partial f}{\partial x^i}\,dx^i.
$$
At each point $p$, we thus have a covector $df_p\colon T_pM\to \mathbb R$ expressed in terms of the basis $dx^1|_p,\dots,dx^n|_p$ by
$$
df_p = \frac{\partial f}{\partial x^i}(p)\,dx^i|_p.
$$
so with respect to the basis $dx^1|_p,\dots,dx^n|_p$, $df_p\in T_p^*M$ can be expressed as the row vector
$$
\begin{bmatrix} \displaystyle\frac{\partial f}{\partial x^1}(p) & \dotsb & \displaystyle\frac{\partial f}{\partial x^n}(p) \end{bmatrix}.
$$
So really, $df_p$ the differential and $df_p$ the covector are literally the same object up to the canonical isomorphism $L_{f(p)}$. I think that we remind ourselves of this isomorphism $L$ maybe the first few times we identify the differential $df_p$ and the covector $df_p$, but we will drop it entirely after we get used to it. With more experience, one comes to appreciate the "intent of the law" rather than strictly follow the "letter of the law," and the interpretations we make are ultimately dictated by the purposes we have in mind.
That said, if one wants to define $\mathrm{grad}f$ "right," without making identifications, then I'd say you need to be comfortable with covector fields, and the musical isomorphism $(\cdot)^\sharp\colon T^*M\cong TM$ that the metric $g$ gives us, so we can do things properly and say simply and without ambiguity that $\mathrm{grad} f = (df)^\sharp$.