In summary, I think the answer is, we simply do not need the underlying distribution to make use of the sample moments, as conventionally defined. One might argue that the sample average is too naive--it is assigning an equal weight of $1/n$ to each observation! Yes, it does seem a little naive, and yet the LLN guarantees, it'll converge under a pretty wide range of conditions and settings. So, one might ask next, okay, but can we speed up this convergence and get a better estimate using knowledge of $f$? Perhaps. Indeed, we could try to estimate the integral
$$\int xf(x)dx$$
using trapezoid rule. But this requires an equally spaced grid, which our observations don't necessarily come in, so what if we try a weighted average like:
$$\sum_{i=1}^n x_i f(x_i)?$$
Well this may or may not improve the quality, but we are making a big ask--we usually are not afforded the privilege of knowing $f$ exactly. So, then we are left with dropping it anyway, and asking about weighted averages like
$$Y = \sum_{i=1}^n w_i X_i,$$
where $w_1+\dotsc +w_n=1$ and $w_i\geq0$.
This is certainly more like the theoretical definition, but once again we have to start asking questions like what choice of weights to use and why? And this probably can get some good improvements, especially for specific industries. But ultimately, the standard sample moments are simpler and require less subjective input.
Here are some perspectives to consider that might help you get used to the fact that sample moments are defined in this way. These, of course, should not be taken too literally, but rather are intended to be broad, and perhaps a little stereotypical, even.
In probability theory, a typical problem is to be given a probability distribution and ask about the chance of some event, or ask about some "statistic" (like the mean, median, mode, maximum, etc). So, if we don't have a distribution, there is not much we can say, since we cannot compute probabilities, expectations, etc--that is if we also do not have data...
...but in statistics, we are given observations $x_1,\dotsc, x_n$ and want to ask things like what is the most likely distribution that would fit this data? If we have a guess about a family $\{f(x;\theta):\theta\in \Theta\}$ of distributions, then we can fit $f$ using methods like MLE to find the "best" estimate for $\theta$. But if we lack such a guess?
Well, luckily we can get a lot of information about a sample without knowing anything about its distribution $f(x;\theta)$. For example, suppose you run a trading desk or a betting team. You want to evaluate the performance of one of your traders, so you look at their winnings and losses over a large time period. You compute the mean and see one trader has a very large positive mean and another trader has a very large negative mean. Since, by the SLLN the RV
$$\bar{X}_n := \frac{1}{n}\sum_{i=1}^n X_i \to \mathbb{E}(X), a.s.$$
we can be confident that these sample estimates are close to the true first moment, provided they exist, and provided we have a large enough data-set. Note the, perhaps pedantic but important, distinction between the RVs $X_1,\dotsc, X_n$ and $\bar{X}_n$ and the observed variates $x_1,\dotsc, x_n$ and $\bar{x}_n$ (these are just plain real numbers).
All of this can be done without knowledge of what $f$ is, and already you have gained some information about your team. If we needed $f$ to compute sample moments, very little could be done when we are starved for information. It is a great benefit to descriptive statistics to be able to compute sample estimates independent of the underlying distribution of the given observations.
Is it possible to prove that regardless of the underlying probability distribution, the corresponding sample moment will always be the same?
We define sample $k$-th moment as
$$\bar{X^k}_n : = \frac{1}{n}\sum_{i=1}^n X_i^k.$$
This expression, as a formula, does not depend on the density of $X_i$. There is no need to prove this because it is simply a definition. However, the distribution of $\bar{X^k}_n$ does depend upon the distribution of $X_i$. Indeed, in the simplest case for $k=1$, and $n=2$, we know that the distribution of $\bar{X}_2$ will be related to the convolution of the densities of $X_1$ and $X_2$.
So, 1) the formula is always the same, by defintion, but 2) the distribution a sample moment has itself as an RV in its own right will depend on the underlying distribution of the data we are computing the moment of (and this will matter for making confidence intervals, error estimates, lower bounds etc). Finally, it is important to realize that distinct samples with different distributions may have the same sample moments, so these quantities do not entirely determine the distribution of an RV. There are many visual examples in machine learning books nowadays, so here is a more toy probability-theoretical example: let $X\sim \mathcal{U}(0, 1)$, this has a mean of $1/2$ and let $Y\sim \mathcal{N}(1/2, 1)$, which has a mean of $1/2$, too. But $X$ is almost surely positive while $Y$ can be negative, so they cannot represent the same data.
Note, that the higher moments differ. If all of the theoretical moments match, then there are some theorems (e.g. here and the MGF correspondence) to show that $X=Y$ in distribution i.e. $F_X(a)=F_Y(a)$, for all $a$.