1

I am new to machine learning and would appreciate some help on the following question. I have observed the literature is focused on algorithms, how one learning does better compared to others for a given data set and remarkable progress have been made on that front. However, I have not been able to find references that discuss the underlying structure of the data space as related to limits of what "can" and "can NOT" be learned from a specific data set/type.

My specific question is as follows: are there formal conditions on a mapping (will settle for examples) f:X→Y , with y∈Y⊂Rn and x∈X⊂Rm, to indicate that f can NOT be "learned" from a finite set of training data : X^={x1,x2,...,xT}⊂X and Y^={y1,y2,...,yT}⊂Y, regardless of the size, T, or choice of the training data?

Nikos M.
  • 2,333
  • 1
  • 6
  • 11
Brian S.
  • 11
  • 2
  • 1
    Welcome to DataScienceSE. The problem with this questions is that there is no generally accepted concept of "learnable" with statistical ML. To my knowledge there are at least 2 formal models of learnability: the PAC model and Gold's model of language identification in the limit, but the latter deals only with symbolic learning of languages. But I suspect that none of these models is used anymore. – Erwan Jul 05 '22 at 21:28
  • You might also be interested in this question. – Erwan Jul 05 '22 at 21:31
  • Thank you @Erwan! This is a good start. I will work on making the "concept" of learning more formal using the the loss function. I feel, however, the question maybe even more basic. Some y=f(x) relationships simply can not be learned. Here is an example: randomly generate 2D points in the x,y plane and randomly assign x's to y's. I can "over-learn" from this type of training data and "memorize" the associations with NN's, but we haven't "learned" anything, because there was nothing to learn. I am looking for the formalism to identify {x,y}'s that don't contain learnable mappings. – Brian S. Jul 05 '22 at 22:09

2 Answers2

1

You can apply statistical techniques that check if there is a functional (possibly non-linear) dependence between $x$ and $y$ variables.

These techniques include (apart from the simplest Pearson and Spearman correlation coefficients)

  1. The maximal information coefficient (link) originally defined by Wassily Hoeffding in 1948 (link). See also here and here.

  2. Alternating Conditional Expectations by Breiman and Friedman (1985 link, Fortran, R).

  3. Distance correlation by Gábor J. Székely (2005, link, R).

  4. Probably many other techniques as none of them is perfect.

There may be a situation when $y$ is totally uncorrelated with each $x_i$ individually but has functional dependence on subsets $x_{i_1},..., x_{i_k}$. Example: XOR problem (link).

Vladislav Gladkikh
  • 1,136
  • 10
  • 19
  • 1
    Thank you for input @Vladislav. The problem with applying statistical methods (especially second order statistical measures in some of these) between two high dimensional spaces (X and Y) do not always yield successful answers, and the challenging mappings are highly non-linear. – Brian S. Jul 06 '22 at 15:37
0

According to the universal approximation theorems mappings that satisfy these criteria are in principle learnable.

One should note that these theorems are not constructive, thus provide no algorithm for constructing the algorithm or verifying the conditions. They only state that if the conditions hold, a task is theoretically learnable by some architecture.

Linear mappings are also constructively and demonstrably learnable by linear models.

Now, whether a certain task is about a mapping that satisfies these criteria, or is outside the scope of these theoretical results, does not necessarily imply that it is unlearnable, although theoretically is an open question.

Assuming that an ML model for a certain mapping provides a shorter algorithmic description of that mapping than the explicit mapping itself, one way to formalize unlearnability would be to claim that there can be no shorter algorithmic description of the mapping except the possibly infinite explicit mapping itself. In other words, the mapping is algorithmically random. In this respect one can employ concepts, for example, such as Kolmogorov complexity and algorithmic randomness.

Nikos M.
  • 2,333
  • 1
  • 6
  • 11
  • 2
    Thank you for your input @Nikos. Interesting perspective, and I do appreciate it, and so it's clear I did not (not able yet) cast a vote on answers as of the time of comment. – Brian S. Jul 06 '22 at 15:25