$
\newcommand\PD[2]{\frac{\partial#1}{\partial#2}}
\newcommand\tPD[2]{\partial#1/\partial#2}
\newcommand\dd{\mathrm d}
\newcommand\R{\mathbb R}
\newcommand\diff\underline
\newcommand\adj\overline
\DeclareMathOperator\Tr{Tr}
\newcommand\Hom{\mathrm{Hom}}
$The following is a slight adaptation of What Is $\nabla$ Anyway? from my answer here.
For simplicity I will only consider real vector spaces and the standard inner product,
but these constraints are not necessary.
Coordinate Expression
We could fiddle around with coordinates.
If $\{x^i\}_{i=1}^m$ are coordinates of $\R^m$,
then there is a function $p(x^1, x^2, \dotsc, x^m)$ taking coordinates to points;
This gives us a basis $\{e_i\}_{i=1}^m$ of $\R^m$ via $e_i = \tPD p{x^i}$,
and then its reciprocal $\{e^i\}_{i=1}^m$ is the unique basis such that $e^i\cdot e_j = \delta^i_j$.
Now let $V$ be a vector space.
Then for any $L_x : \R^m \to V$ which is linear for each $x \in \R^m$, i.e. $L_x(v)$ is linear in $v$, we define
$$
L_x(\nabla) = \sum_{i=1}^m \PD{L_{p(x^1,\dotsc,x^m)}(e^i)}{x^i}.
$$
This follows from informally applying the linearity of $L$ to
$$
L_x(\nabla) = L_{p(x^1,\dotsc,x^m)}\left(\sum_{i=1}^me^i\PD{}{x^i}\right).
$$
We would then go through the tedious task of confirming that this definition is independent of the coordinates chosen.
But I don't like coordinates.
Without Coordinates
First, for any function $f : \R^m \to V$ into some normed vector space $V$,
the total differential $\diff f(x; {-}) : \R^m \to V$ of $f$ at $x \in \R^m$
is the linear function which best approximates $f$ at $x$.
This captures all notions of the variation of $f$,
and for any finite-dimensional $V$ is uniquely defined.
Consider again linear functions $L_x : \R^m \to V$;
if $\Hom(\R^m, V)$ is the set of linear functions $\R^m \to V$,
then we can think of $L$ as a function $\R^m \to \Hom(\R^m, V)$
and so $\diff L(x; v) \in \Hom(\R^m, V)$ for any $v \in \R^m$.
If $V$ is finite dimensional, then so is $\Hom(\R^m, V)$ and the differential is uniquely defined;
we will be assuming this is the case.
We then define the derivative of $L$ as
$$
L_x(\nabla) = \Tr_{v\cdot w}\diff L(x; v)(w)
\tag{Deriv}
$$
where $\Tr_{v\cdot w}$ is indicating a metric contraction
of the tensor $\diff L(x; {-})({-}) : \R^m\times\R^m \to V$.
Note that $L_x(\nabla) \in V$,
so in this sense $L_x$ takes $\nabla$ to $L_x(\nabla)$
just like it takes $v \in \R^m$ to $L_x(v)$.
It is easy to verify that
$$
(aL_x + bM_x)(\nabla) = aL_x(\nabla) + bM_x(\nabla),\quad a, b \in \R.
$$
The way this works then is by defining an appropriate $L$:
- If $f : \R^m \to \R$, then the gradient of $f$ is the derivative of $L_x(v) = vf(x)$.
- If $F : \R^m \to \R^m$, then the divergence of $f$ is the derivative of $L_x(v) = v\cdot F(x)$.
- If $F : \R^3 \to \R^3$, then the curl of $F$ is the derivative of $L_x(v) = v\times F(x)$.
- We can also have some things which are not expressible in standard notation.
If $L_x(v) = F(x)\cdot(v\times G(x))$,
then $L_x(\nabla)$ is not $F(x)\cdot(\nabla\times G(x))$;
$L_x(\nabla)$ is differentiating both $F(x)$ and $G(x)$.
- It is easy to show that an expression like $L_x(\nabla, \nabla)$ for $L_x$ bilinear
is independent of the order in which each $\nabla$ is applied
(so long as partial derivatives commute,
or equivalently $\diff{\diff L}(x; {-}, {-})$ is symmetric).
This lets us define e.g. the Laplacian of $f(x)$ via $L_x(v, w) = (v\cdot w)f(x)$;
notice that $f$ can be valued in any vector space.
Motivation
The definition (Deriv) should be unsatisfying;
we basically just wrote the coordinate definition in coordinate-free language.
How can we justify it?
We make one assumption for $F : \R^m \to \R^m$:
$$
\nabla\cdot F(x) = \Tr_v\diff F(x; v),
$$
i.e. that $L_x(\nabla)$ for $L_x(v) = v\cdot F(x)$ is the trace of the differential of $F$.
Now, the inner product induces an isomorphism $\flat : \R^m \to (\R^m)^*$
via $v^\flat(w) = v\cdot w$, where $(\R^m)^* = \Hom(\R^m, \R)$ is the dual space of $\R^m$.
We denote $\sharp = \flat^{-1}$.
If $L_x : \R^m \to \R$ then we argue
$$
L_x(\nabla) = L_x^\sharp\cdot\nabla = \Tr_u\diff L(x; u)^\sharp = \Tr_{u\cdot v}\diff L(x; u)(v).
$$
Thus when $L_x : \R^m \to V$ and $\alpha \in V^*$ we argue
$$
\alpha(L_x(\nabla))
= (\alpha\circ L_x)(\nabla)
= \Tr_{u\cdot v}\diff{\alpha\circ L_x}(x; u)(v)
= \Tr_{u\cdot v}\alpha\bigl(\diff L(x; u)(v)\bigr).
$$
Contracting the tensor $(w, \alpha) \mapsto w\alpha(L_x(\nabla))$ for $w \in V$ finally gives
$$\begin{aligned}
L_x(\nabla)
&= \Tr_{\alpha(w)}w\Tr_{u\cdot v}\alpha(\diff L(x; u)(v))
\\
&= \Tr_{u\cdot v}\Tr_{\alpha(w)}w\alpha(\diff L(x; u)(v))
\\
&= \Tr_{u\cdot v}\diff L(x; u)(v)
\end{aligned}$$
as desired.
The Chain Rule
Let $F : \R^m \to \R^n$.
The chain rule takes the following form:
$$
L_{F(x)}(\nabla)
= L_{F(x)}\Bigl(\adj F(x; \nabla_F)\Bigr).
$$
$\adj F(x; {-})$ is the adjoint of $\diff F(x; {-})$ under the inner products on $\R^m$ and $\R^n$.
To be clear, what we mean by the RHS
is the derivative of $y \mapsto L_y(\adj F(x; {-}))$ evaluated at $y = F(x)$.
This chain rule can be expressed by the mnemonic
$$
\nabla_x = \adj F(x; \nabla_F).
$$
The Subexpression Rule
Consider $\R^{m+n} \cong \R^m\oplus\R^n$ and $L_{x\oplus y} : \R^m\oplus\R^n \to V$.
Then
$$
L_{x\oplus x}(\nabla) = L_{\dot x\oplus x}(\dot\nabla) + L_{x\oplus\dot x}(\dot \nabla),
$$
where by e.g. $L_{\dot x\oplus x}(\dot\nabla)$
we mean the derivative of $x \mapsto L_{x\oplus y}$ evaluated at $y = x$;
in other words, the undotted $x$ should be held constant when differentiating.
We can summarize this "subexpression rule" as follows:
the derivative of an expression is the sum of the derivatives of its subexpressions.
We will demonstrate this rule in the next section.
Vectorial Manipulation
It is a simple fact that if $L_x(v) = M_x(v)$ then $L_x(\nabla) = M_x(\nabla)$;
moreover, if $L_x$ and $M_x$ are multilinear
then $L_x(\nabla,\dotsc,\nabla) = M_x(\nabla,\dotsc,\nabla)$.
This means that so long as an expression involving $\nabla$ can be uniquely linearized,
then the use of $\nabla$ is valid and it can be manipulated as a vector,
so long as every instance of $x$ is differentiated.
(In this sense, the Laplacian $\nabla^2$ is not well-defined over fields of characteristic 2;
in characteristic not 2, there is a (unique) symmetric bilinear form $v\cdot w$ such that $v\cdot v = |v|^2$,
but in characteristic 2 this is not true.)
Some examples would be prudent.
Consider now a vector identity like (in $\R^3$)
$$
a\times(b\times c) = (a\cdot c)b - (a\cdot b)c.
$$
If $L(a, b, c)$ is the LHS and $M(a, b, c)$ the RHS,
then it is automatically true that e.g.
$L(\nabla, F(x), c) = M(\nabla, F(x), c)$ since $L = M$.
(To be clear, here we mean e.g. the derivative of $x \mapsto L({-}, F(x), c)$.)
Written in standard notation, this looks like
$$
\nabla\times(F(x)\times c) = (c\cdot\nabla)F(x) - (\nabla\cdot F(x))c.
$$
This shows how many people go wrong with manipulating $\nabla$ expressions;
because of the usual convention of "differentiate directly to the right",
we had to flip the first term around.
But this convention becomes unusable in a situation like $L(\nabla, F(x), G(x)) = M(\nabla, F(x), G(x)$;
this is still true since $L = M$.
But the following is false in conventional notation:
$$
\nabla\times(F(x)\times G(x)) \not= (\nabla\cdot G(x))F(x) - (\nabla\cdot F(x))G(x)
$$
since we need to differentiate both $F$ and $G$ in each expression,
not just one or the other as the conventional notation suggests.
What we might write to make this clear is
$$
\nabla\times(F(x)\times G(x)) = (\dot\nabla\cdot G(\dot x))F(\dot x) - (\dot\nabla\cdot F(\dot x))G(\dot x).
$$
To write this in conventional notation,
we have to use the subexpression rule to differentiate one function at a time:
$$\begin{aligned}
&\nabla\times(F(x)\times G(x))
\\
&\quad= \bigl(\nabla\cdot G(x)\bigr)F(x) + \bigl(G(x)\cdot\nabla\bigr)F(x)
- \bigl(\nabla\cdot F(x)\bigr)G(x) - \bigl(F(x)\cdot\nabla\bigr)G(x).
\end{aligned}$$
Now consider the identity
$$
|v|^2|w|^2 = (v\cdot w)^2 + |v\times w|^2.
$$
Can we say that
$$
\nabla^2|F(x)|^2 = (\nabla\cdot F(x))^2 + |\nabla\times F(x)|^2?
$$
No: we can linearize the above as
$$
(u\cdot v)|w|^2 = (u\cdot w)(v\cdot w) + (u\times w)\cdot(v\times w)
$$
and take its second derivative, but each $\nabla$ must differentiate each copy of $F(x)$.
That is to say that
$$
\nabla^2|F(x)|^2
= (\dot\nabla\cdot F(\hat{\dot x}))(\hat\nabla\cdot F(\hat{\dot x}))
+ (\dot\nabla\times F(\hat{\dot x}))\cdot(\hat\nabla\times F(\hat{\dot x}))
$$
Both of these terms then need to be expanded with the subexpression rule into a total of 8 terms,
only two of which will be $(\nabla\cdot F(x))^2$ and $|\nabla\times F(x)|^2$.
Integral Representation
Again using the divergence is a starting point, it is well-known that
$$
\nabla\cdot F(x) = \lim_{R_x\to 0}\frac1{|R_x|}\oint_{\partial R_x}\hat n(y)\cdot F(y)\,\dd S
$$
where $R_x$ is a region with non-zero volume containing $x$, $\partial R_x$ is its boundary, $\hat n$ is the outward-pointing unit normal to $\partial R_x$, $\dd S$ is the surface area measure on $\partial R_x$, and $y$ is the vector variable of integration. The limit is taken such that the regions $R_x$ decrease in diameter around $x$.
Following the Motivation section we get
$$\begin{aligned}
\alpha(L_x(\nabla))
&= (\alpha\circ L_x)(\nabla)
= \nabla\cdot(\alpha\circ L_x)^\sharp
\\
&= \lim_{R_x\to 0}\frac1{|R_x|}\oint_{\partial R_x}\hat n(y)\cdot(\alpha\circ L_y)^\sharp\,\dd S
\\
&= \lim_{R_x\to 0}\frac1{|R_x|}\oint_{\partial R_x}\alpha(L_y(\hat n(y)))\,\dd S
\\
&= \alpha\left(
\lim_{R_x\to 0}\frac1{|R_x|}\oint_{\partial R_x}L_y(\hat n(y))\,\dd S
\right)
\end{aligned}$$
for $L_x : \R^m \to V$ and any $\alpha \in V^*$, hence
$$
L_x(\nabla) = \lim_{R_x\to 0}\frac1{|R_x|}\oint_{\partial R_x}L_y(\hat n(y))\,\dd S
$$
where $y$ is the variable of integration.
This shows that $L_x(\nabla)$ can be interpreted as an average
over an infinitesimal neighborhood of $x$.
Combining this with the algebraic expression for the derivative
gives a geometric interpretation of the metric contraction of a tensor.
For any $M(v, w)$ multilinear that has an "antiderivative" $L_x(v)$,
i.e. $\diff L(x; v)(w) = M(v, w)$ for some $x$, we may write
$$
\Tr_{v\cdot w} M(v, w) = \lim_{R_x\to0}\frac1{|R_x|}\oint_{\partial R_x}L_y(\hat n(y))\,\dd S.
$$
Such an antiderivative is easy to construct in flat space: define $L_x(v) = M(x, v)$.
Following this train of thought to its conclusion will give
$$
\Tr_{v\cdot w} M(v, w)
= \frac1{|V_m|}\oint_{\partial V_m}M(y, y)\,\dd S
= \frac m{|\partial V_m|}\oint_{\partial V_m}M(y, y)\,\dd S
$$
where $V_m \subset \R^m$ is the unit ball centered at the origin,
$|V_m|$ its volume, and $|\partial V_m|$ its surface area.
This last expression is exactly $m$ times the average of $y \mapsto M(y, y)$ over the unit sphere.