I think to understand Leibniz notation it is best to get into a “Leibniz mindset” because at his time the idea of a function was different from today. As far as I know, he would talk about $y$ being a a function of $x$, meaning that the value $x$ determines the value of $y$. In modern parlance: There exists a (modern) function $f : \mathbb{R} \to \mathbb{R}$ such that $y = f(x)$.
But note that there is no $x$ on the left-hand side. The value of $y$ really depends on the “current” value of $x$. This is different from modern functions where writing $f(x) = x^2$ and $f(z) = z^2$ really defines the same function $f$. As such, writing $y(x_0)$ does not really make sense, as $y$ might be a function of different variables in different ways, and to express the same thing one should specify which variable one fixes and write $y|_{x=x_0}$ (i.e. mention $x$ explicitly). Writing $y|_{x_0}$ or even $y(x_0)$ is a common shorthand, though, and often does not lead to confusion.
In this context, Leibniz notation makes sense: An (infinitesemal) change of $x$ gives by the implicit dependence of $y$ on $x$ an (infinitesemal) change of $y$; their quotient is the derivative $\frac{dy}{dx}$. Now as you know, the value of this also depends on the value of $x$. In other words, $\frac{dy}{dx}$ is again a “function of $x$”, namely $\frac{dy}{dx} = f'(x)$ (if $y = f(x)$).
If we want the value of this at a specific point, we can write $\frac{dy}{dx}|_{x = x_0}$ or $\frac{dy}{dx}|_{x_0}$ for $f'(x_0)$.
For Leibniz' chain rule, we are in the following situation: The variable $u$ is a function of $x$, say $u = f(x)$, and the variable $y$ is a function of $u$, say $y=g(u)$. Substituting, we see that $y = g(f(x)) = (g \circ f)(x)$, i.e. $y$ is also a function of $x$. We can therefore try to compute $\frac{dy}{dx}$. Using the chain rule you know for the $'$-notation, we see:
$$
\frac{dy}{dx} = (g \circ f)'(x) = g'(f(x)) f'(x) = g'(u) f'(x) = \frac{dy}{du} \frac{du}{dx}.
$$
This is the usual form of this chain rule. Note that the first factor $\frac{dy}{du}$ is a function of $u$ but also a function of $x$ because $u$ is a function of $x$.
If we want to evaluate at a specific point $x_0$, we run into the problem mentioned above that $\frac{dy}{du}$ is a function of both $x$ and $u$ in different ways. Now $x_0$ looks like is should be the value for $x$, but in your question you use $a$ where this is less clear. And $\frac{dy}{du}$ is more obviously dependent on $u$; so to be explicit, we write down the transformation from $x$ to $u$, i.e.
$$
\frac{dy}{dx}|_{x=x_0} = \frac{dy}{du}|_{u=f(x_0)} \frac{du}{dx}|_{x=x_0}.
$$
Using the shorthands above,
$$
\frac{dy}{dx}|_{x_0} = \frac{dy}{du}|_{u(x_0)} \frac{du}{dx}|_{x_0},
$$
where we also employed the confusion between $u$ and the function $f$.
Why do we use this notation? Aside from convention, it has two advantages:
Defining a dependence is often easier, i.e. we can write down and derive the term $x^3 + \exp(x)$ immediately instead of having to define a function $f$ with $f(x) = x^3 + \exp(x)$ first.
Once you have functions of several variables it is nice to be able to refer to them by name in the derivative notation (instead of having to use numerical indices). I think this is why physicists often still use this notion of “function”.
Last, I want to mention that if you go on to study differential geometry, you will learn how to do all of this in a way that kind of gives you the best of both worlds: Normal function (which I think are conceptually much clearer than these “functions of $x$”) combined with the ability to refer to names instead of indices.