31

I have an input which is a list and the output is the maximum of the elements of the input-list.

Can machine learning learn such a function which always selects the maximum of the input-elements present in the input?

This might seem as a pretty basic question but it might give me an understanding of what machine learning can do in general. Thanks!

Peter
  • 7,446
  • 5
  • 19
  • 49
user78739
  • 319
  • 1
  • 3
  • 3
  • 1
    I think you can try this as a series problem i.e. using Recurrent Neural Network. Feed sorted data to the network. – vipin bansal Jul 31 '19 at 12:38
  • 2
    See also https://datascience.stackexchange.com/q/22242, https://datascience.stackexchange.com/q/29345; neural networks can sort an input list, so certainly can extract a maximum. – Ben Reiniger Jul 31 '19 at 16:20
  • 3
    @TravisBlack: actually, this is definitely the type of function that you cannot learn with standard neural networks. As an example, suppose you simply plug in a vector with a value to predict that was greater than any value in you had in your training set. Do you think the trained neural network will give you back that largest value? – Cliff AB Aug 01 '19 at 03:47
  • 11
    @TravisBlack NOOO! Neural networks can not learn “basically any” mathematical function. Cardinality-wise, almost all functions are pathological almost-everywhere discontinuous ones. What you probably mean is, lots of the functions that mathematicians are actually interested in happen to be well-behaved enough that neural networks can approximate them arbitrarily well. But that's not at all the same thing as being able to learn any function. – leftaroundabout Aug 01 '19 at 08:15
  • 6
    @leftaroundabout and Cliff: It's good to see that someone stays on the ground in the recent ML/DL hype. People are using NNs, and when you dig one level deeper, you notice that they often don't have the slightest idea what they are actually doing there - beyond blindly tweaking parameters from some keras "Hello World" example until they see some pattern. xkcd got this exactly right: https://xkcd.com/1838/ . I hope that someone can still add an answer here that is more profound than the current ones appear to be. (No offense to anyone, but the common lack of understanding of NNs bugs me...) – Marco13 Aug 01 '19 at 13:28
  • @BenReiniger I think you should make that an answer. Rather than some handcrafted examples which indicate its possible for some special cases, the questions you linked refer to peer-reviewed articles with well-specified theorems which prove the answer is definitively yes (and bound the size of the network you need for a particular length sequence/number as well!) – Steven Jackson Aug 01 '19 at 15:27
  • In any case the result will be massively inefficient compared to the obvious simple O(log N) tree of comparisons. Generally you would not want to abuse ML to do this. (By the way, when you say "the input is a list" - of what? small integers? large integers? floats?) – smci Aug 03 '19 at 11:30
  • 1
    "Machine Learning" is an umbrella term that encompasses more than neural networks. If you include genetic programming as machine learning, it might work (though would be massive overkill). It would do so trivially if max was a built-in part of the language you are evolving a program in, but would probably if you simply had comparison operators (and enough other stuff to make it Turing complete). – John Coleman Aug 03 '19 at 13:54

7 Answers7

39

Maybe, but note that this is one of those cases where machine learning is not the answer. There is a tendency to try and shoehorn machine learning into cases where really, bog standard rules-based solutions are faster, simpler and just generally the right choice :P

Just because you can, doesn't mean you should

Edit: I originally wrote this as "Yes, but note that..." but then started to doubt myself, having never seen it done. I tried it out this afternoon and it's certainly doable:

import numpy as np
from keras.models import Model
from keras.layers import Input, Dense, Dropout
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping

# Create an input array of 50,000 samples of 20 random numbers each
x = np.random.randint(0, 100, size=(50000, 20))

# And a one-hot encoded target denoting the index of the maximum of the inputs
y = to_categorical(np.argmax(x, axis=1), num_classes=20)

# Split into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y)

# Build a network, probaly needlessly complicated since it needs a lot of dropout to
# perform even reasonably well.

i = Input(shape=(20, ))
a = Dense(1024, activation='relu')(i)
b = Dense(512, activation='relu')(a)
ba = Dropout(0.3)(b)
c = Dense(256, activation='relu')(ba)
d = Dense(128, activation='relu')(c)
o = Dense(20, activation='softmax')(d)

model = Model(inputs=i, outputs=o)

es = EarlyStopping(monitor='val_loss', patience=3)

model.compile(optimizer='adam', loss='categorical_crossentropy')

model.fit(x_train, y_train, epochs=15, batch_size=8, validation_data=[x_test, y_test], callbacks=[es])

print(np.where(np.argmax(model.predict(x_test), axis=1) == np.argmax(y_test, axis=1), 1, 0).mean())

Output is 0.74576, so it's correctly finding the max 74.5% of the time. I have no doubt that that could be improved, but as I say this is not a usecase I would recommend for ML.

EDIT 2: Actually I re-ran this this morning using sklearn's RandomForestClassifier and it performed significantly better:

# instantiation of the arrays is identical

rfc = RandomForestClassifier(n_estimators=1000, verbose=1)
rfc.fit(x_train, y_train)

yhat_proba = rfc.predict_proba(x_test)


# We have some annoying transformations to do because this .predict_proba() call returns the data in a weird format of shape (20, 12500, 2).

for i in range(len(yhat_proba)):
    yhat_proba[i] = yhat_proba[i][:, 1]

pyhat = np.reshape(np.ravel(yhat_proba), (12500,20), order='F')

print(np.where(np.argmax(pyhat, axis=1) == np.argmax(y_test, axis=1), 1, 0).mean())

And the score here is 94.4% of samples with the max correctly identified, which is pretty good indeed.

Dan Scally
  • 1,754
  • 7
  • 25
  • +1 Love love love this. I even wonder how well machine learning would do in a setup where the input is vectors $\vec{x} \in \mathbb{R}^n$ and the output is max${\vec{x}}$. – Dave Jul 31 '19 at 11:17
  • Exactly Dave, that is exactly where I was also arriving at... – user78739 Jul 31 '19 at 11:42
  • Yeah I jumped to "Yes" because it's such an easy problem but now I'm not so sure myself :P. The substantive part of the answer is "don't bother", so I'll amend it to that I'm afraid (or possibly spend the rest of the today working out if I can do it or not...) – Dan Scally Jul 31 '19 at 11:53
  • This doesn't seem very helpful, since it doesn't seem to even try to answer the question. – Travis Black Jul 31 '19 at 13:13
  • 1
    @TravisBlack yeah I originally started it as "Yes, but..." but then doubted myself and equivocated. I've improved the answer now :). – Dan Scally Jul 31 '19 at 14:50
  • CC @user78739, see improved response. – Dan Scally Jul 31 '19 at 14:50
  • I think this answer presupposes that the goal is efficiency. If one is simply trying to make a general AI that learns, learning how to do a max function is a great test since it is easily checked. – CramerTV Jul 31 '19 at 22:26
  • @CramerTV Yeah sure, I'm presupposing that the intention is to apply an algorithm to some real world problem. I could well be totally wrong. – Dan Scally Jul 31 '19 at 22:31
  • @DanScally, it all depends on which real world problem you're working on. If you're trying to create an AI for a narrow, specific issue, I'm with you. My real world problem is AGI (I'm not saying I'm equipped to solve that problem but it is my focus.) – CramerTV Jul 31 '19 at 22:55
  • @CramerTV ah. Yeah you may be right; I hadn't considered it in that context at all to be honest; AGI isn't something I've encountered yet! – Dan Scally Aug 01 '19 at 06:57
  • 19
    When training and testing the whole thing with vectors that contain values in [0,100], then the score is about 0.95. Fine. But when training it with values in [0,100], and testing it with values in [100,200], the score is practically zero. You already took a step back with your edit. But to make this unambigously clear for those who blindly see ML as the miracle weapon that can solve all problems: Whatever you are learning there: It is NOT 'the maximum function'!. – Marco13 Aug 01 '19 at 13:19
  • Yeah; discouraging inappropriate use was the primary intent in my answer. You think the rest of it distracts from that? – Dan Scally Aug 01 '19 at 13:45
  • 2
    (An aside: On order to notify others about responses to their comments, use @, as in @Marco13). Regarding the question: I think that your statement "machine learning is not the answer" makes it clear. I'm mainly afraid that too many people don't apply the appropriate scrutiny when using ML/DL/NNs, and particularly, when they encounter something that looks like it could "solve their problem", without understanding why it appears to do so, and thus without recognizing when a "solution" is only an artifact of a not-so-well understood process. – Marco13 Aug 01 '19 at 13:54
  • @Marco13 It's well known that you should standardize inputs to neural networks. If you scaled both your [0,100] training set and [100,200] test set to fit in the range [0,1], then it should be no problem. – Brady Gilg Aug 01 '19 at 23:23
  • "And the score here is 94.4% of samples with the max correctly identified, which is pretty good indeed." - Then it's not learning the max() function, because max() will return the correct result 100% of the time. And won't blow up if given a number it hasn't seen in its training data. If the model can't generalize to things it hasn't seen before then it's not learning; just remembering. If ML can learn max() you should be able to train on data from +/-INT_MAX(without hitting every possible value) and get something that correctly answers inputs from +/-LONG_MAX. – aroth Aug 02 '19 at 03:06
  • 2
    @aroth sure; at best this is an approximation of max() applicable to the scope of the training data that it's seen. I was toying with the problem, but I don't intend to detract from the primary sentiment of my answer which is don't use ML for this kind of problem. – Dan Scally Aug 02 '19 at 07:03
  • @BradyGilg Actually, the fact that that would fix the issue, shows that the model is not actually learning what you expect it to learn. What the model appears to have learnt is that the indices of numbers high in the range of trainingdata are likely to be the max. That doesn't work if you shift the possible range. If you normalize the input list, you are just shifting that problem again towards 1, instead of 100 in this example. If you normalize each input to be in the range 0,1 you just teach the model that the argmax is where the list==1. That's not finding the argmax. – JAD Aug 02 '19 at 11:26
  • 1
    @BradyGilg Standardizing the input data... uhhm... while you're probably right in that this would yield "better" results, the results still wouldn't make much sense, because the NN is not "learning the maximum function". And the argument is in some ways obviously a very academic one - I'd even say "too academic": You want to compute/predict the max's of some vectors, and in order to compute the max, you first have to compute the min/max to do a normalization (or mean/stdDev for a standardization, which doesn't seem to be very sensible either). – Marco13 Aug 02 '19 at 11:58
  • @JAD I understand that it's not argmax function. But I still wonder how does it find where the list==1. Also, here the list size is 20 but how would you approach this problem if the list size is longer , say like 1000. I was doing a similar problem few weeks ago where I approached this problem via CNN. Problem description: I had an array of 4000 elements with small noise in most of the elements and 1 in one of the 4000 elements. I proceeded by normalizing the Y column by dividing by 4000. I used a CNN layer,sigmoid output, solved as a regression.I could get correlation approx 1 but I doubt it. – Siddharth Dhanpal Aug 14 '19 at 16:03
28

Yes. Very importantly, YOU decide the architecture of a machine learning solution. Architectures and training procedures don't write themselves; they must be designed or templated and the training follows as a means of discovering a parameterization of the architecture fitting to a set of data points.

You can construct a very simple architecture that actually includes a maximum function:

net(x) = a * max(x) + b * min(x)

where a and b are learned parameters.

Given enough training samples and a reasonable training routine, this very simple architecture will learn very quickly to set a to 1 and b to zero for your task.

Machine learning often takes the form of entertaining multiple hypotheses about featurization and transformation of input data points, and learning to preserve only those hypotheses that are correlated with the target variable. The hypotheses are encoded explicitly in the architecture and sub-functions available in a parameterized algorithm, or as the assumptions encoded in a "parameterless" algorithm.

For example, the choice to use dot products and nonlinearities as is common in vanilla neural network ML is somewhat arbitrary; it expresses the encompassing hypothesis that a function can be constructed using a predetermined compositional network structure of linear transformations and threshold functions. Different parameterizations of that network embody different hypotheses about which linear transformations to use. Any toolbox of functions can be used and a machine learner's job is to discover through differentiation or trial and error or some other repeatable signal which functions or features in its array best minimize an error metric. In the example given above, the learned network simply reduces to the maximum function itself, whereas an undifferentiated network could alternatively "learn" a minimum function. These functions can be expressed or approximated via other means, as in the linear or neural net regression function in another answer. In sum, it really depends on which functions or LEGO pieces you have in your ML architecture toolbox.

pygosceles
  • 449
  • 3
  • 4
  • 4
    +1 ML is nothing more than fancy regression equations and demands the right choice of equations. –  Jul 31 '19 at 23:37
  • 4
    @aidan.plenert.macdonald the impact and appeal of ML, though, is that there is not one right choice of equations. Your chosen equations need to be a member of the set of suitable equations, but it turns out that for a broad range of problems that set contains equations that are much more generalised than a carefully-designed solution might be, but yield parameters that solve the problem much more quickly than putting in the additional design effort. This question is a good example of how this doesn't eliminate model design considerations altogether. – Will Aug 01 '19 at 09:37
  • 3
    That wasn't ever the question. The OP asked whether ML can find (/learn/infer) a function like max() (from labeled data). They didn't say "*Given that you already have max() as a building-block"* – smci Aug 03 '19 at 11:36
  • @smci There is no "universal" prior for machine learning architectures or functions. As mentioned in my answer, you can approximate a maximum function using piecewise linear functions interspersed with nonlinearities--but there is no universal rule that says that all ML has to use that particular set of transformations in its toolbox. Neural networks often (but not always) have a maximum function at their disposal via Max Pooling or ReLU nonlinearities. The number of possible feature functions is limitless, which is why I highlight the role of choice and predisposed bias in ML architecture. – pygosceles Aug 05 '19 at 17:09
  • 1
    This is essentially the "correct" answer to the question asked. Aside from the practice of machine learning, there is actually a comprehensive and deep theoretical side to most of the things (especially the basics of machine learning) that you encounter in practice. You can study these topics in advance under Machine Learning Theory. As mentioned by others, before you observe your dataset, you need to make assumption about the hypothesis space. This is related to one theoretic result called No Free Lunch Theorem, which Inductive bias is the idea behind it. – Moher Oct 26 '20 at 12:49
7

Yes - Machine learning can learn to find the maximum in a list of numbers.

Here is a simple example of learning to find the index of the maximum:

import numpy as np
from sklearn.tree import DecisionTreeClassifier

# Create training pairs where the input is a list of numbers and the output is the argmax
training_data = np.random.rand(10_000, 5) # Each list is 5 elements; 10K examples
training_targets = np.argmax(input_data, axis=1)

# Train a descision tree with scikit-learn
clf = DecisionTreeClassifier()
clf.fit(input_data, targets)

# Let's see if the trained model can correctly predict the argmax for new data
test_data = np.random.rand(1, 5)
prediction = clf.predict(test_data)
assert prediction == np.argmax(test_data) # The test passes - The model has learned argmax
Brian Spiering
  • 21,136
  • 2
  • 26
  • 109
  • Is it really learning the "maximum" function? A training set of 10,000 five-element lists is a reasonable approximation to the complete input space. – Mark Jul 31 '19 at 22:15
  • 2
    Disclaimer: I'm not a ML/DL expert. But I'm pretty sure that this does not make any sense. I mean: No sense, at all. As I see it, you're not learning the maximum function. You're learning the indices of the maximum elements of the training set. If you input a vector that contains two numbers that are both larger than that of the training set, it would likely fail. Not to mention the case where you don't have a 5D- but a 10D-vector. Throwing some data into a library that one doesn't unterstand and seeing a certain result does NOT (at all) mean that it "works". – Marco13 Aug 01 '19 at 12:47
  • I mean, it depends on what "it works" is supposed to mean. A decision tree in particular is only ever going to produce a piecewise-constant function, pieces being axis-aligned rectangular boxes. In the max example, training on a solid hypercube, the actual max function is piecewise-constant on some triangular sort of regions. Given enough training examples and depth, the tree will approximate these triangular regions to arbitrary accuracy. But, as with many (most?) other models, any test samples outside the training samples' range is pretty hopeless. – Ben Reiniger Aug 02 '19 at 00:41
  • 1
    This doesn't prove anything. The OP asked "the maximum in a list of numbers". You assumed they must be floats in the range 0..1. Try to input a 2 (or -1, or 1.5) and it'll fail. – smci Aug 03 '19 at 11:38
4

Learning algorithms

Instead of learning a function as a calculation done by a feed-forward neural network, there's a whole research domain regarding learning algorithms from sample data. For example, one might use something like a Neural Turing Machine or some other method where execution of an algorithm is controlled by machine learning at its decision points. Toy algoritms like finding a maximum, or sorting a list, or reversing a list, or filtering a list are commonly used as examples in algorithm learning research.

Peteris
  • 375
  • 2
  • 5
3

I will exclude educated designs from my answer. No it is not possible to use an out of the box machine learning (ML) approach to fully represent the maximum function for arbitrary lists with arbitrary precision. ML is a data-based method and it is clear that you will not be able to approximate a function at regions where you do not have any data points. Hence, the space of possible observations (which is infinite) cannot be covered by finite observations.

My statements have a theoretical foundation with Cybeko’s Universal Approximation Theorem for neural networks. I will quote the theorem from Wikipedia:

In the mathematical theory of artificial neural networks, the universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.

The most important part is the bounded subset of $\mathbb{R}^n$. This additional statement restricts the application of approximating the maximum function for $x\in \mathbb{R}$. This restriction is manifesting itself in the poor fit of the model from the answer with the most upvotes.

If your space of observations is compact then you might be able to approximate the maximum function with a finite data set. As the top voted answer made clear you should not reinvent the wheel!

MachineLearner
  • 1,928
  • 8
  • 15
1

Here's an expansion on my comment. To preface, absolutely @DanScally is right that there's no reason to use ML for finding a maximum of a list. But I think your "it might give me an understanding of what machine learning can do in general" is good enough reason to delve into this.

You ask about more general machine learning, but I'll focus on neural networks. In that context, we must first ask whether the actual functions produced by a neural network can approximate (or evaluate exactly) $\max$, and only then can we further inquire whether any of the (common?) training methods can fit a NN approximating $\max$.


The comments, and @MachineLearner's answer brought up universal approximation theorems: on a bounded domain, a neural network can approximate any reasonably nice function like $\max$, but we can't expect a priori to approximate $\max$ on arbitrary input, nor to exactly calculate $\max$ anywhere.

But, it turns out that a neural network can exactly sort arbitrary input numbers. Indeed, $n$ $n$-bit integers can be sorted by a network with just two hidden layers of quadratic size. Depth Efficient Neural Networks for Division and Related Problems, Theorem 7 on page 955; many thanks to @MaximilianJanisch in this answer for finding this reference.

I'll briefly describe a simplification of the approach in that paper to produce the $\operatorname{argmax}$ function for $n$ arbitrary distinct inputs. The first hidden layer consists of $\binom{n}{2}$ neurons, each representing the indicator variable $\delta_{ij} = \mathbf{1}(x_i < x_j)$, for $i<j$. These are easily built as $x_j-x_i$ with a step activation function. The next layer has $n$ neurons, one for each input $x_i$; start with the sum $\sum_{j<i} \delta_{ji} + \sum_{j>i} (1-\delta_{ij})$; that is, the number of $j$ such that $x_i>x_j$, and hence the position of $x_i$ in the sorted list. To complete the argmax, just threshold this layer.
At this point, if we could multiply, we'd get the actual maximum value pretty easily. The solution in the paper is to use the binary representation of the numbers, at which point binary multiplication is the same as thresholded addition. To just get the argmax, it suffices to have a simple linear function multiplying the $i$th indicator by $i$ and summing.


Finally, for the subsequent question: can we can train a NN into this state. @DanScally got us started; maybe knowing the theoretical architecture can help us cheat into the solution? (Note that if we can learn/approximate the particular set of weights above, the net will actually perform well outside the range of the training samples.)

Notebook in github / Colab

Changing things just a little bit, I get better testing score (0.838), and even testing on a sample outside the original training range gets a decent score (0.698). Using inputs scaled to $[-1,1]$ gets the test score up to 0.961, with an out-of-range score of 0.758. But, I'm scoring with the same method as @DanScally, which seems a little dishonest: the identity function will score perfectly on this metric. I also printed out a few coefficients to see whether anything close to the above described exact fit appears (not really); and a few raw outputs, which suggest the model is too timid in predicting a maximum, erring on the side of predicting that none of the inputs are the maximum. Maybe modifying the objective could help, but at this point I've put in too much time already; if anyone cares to improve the approach, feel free to play (in Colab if you like) and let me know.

Ben Reiniger
  • 11,770
  • 3
  • 16
  • 56
  • I haven't yet wrapped my head around the paper (which is math-heavy... and surprisingly old...), but even though it might just be the ambiguous term "network" that brought this association to my mind, I wondered whether one could design a neural network that essentially "emulates" a sorting network ... – Marco13 Aug 12 '19 at 18:53
  • @Marco13, sure, I think using that paper to produce NNs as comparators would produce a NN emulation of the sorting network. It would be quite a lot deeper than the paper's, but the width might get shrunk down to linear size? – Ben Reiniger Aug 13 '19 at 15:44
  • Admittedly, I'm not nearly as deeply involved in NN as I needed to be to say something profound. But things like ~"you can emulate everything with two layers" sounds a bit like the results from low-level circuit design where you say that you can "implement every function with two layers of NAND gates" or whatnot. I think that some of the NNs that are examined recently are just fancy versions of things that people already discovered 50 years ago, but maybe this is a misconception... – Marco13 Aug 13 '19 at 22:16
0

Yes, even as simple machine learning as ordinary linear least squares can do this if you use some applied cleverness.

(But most would consider this quite horrible overkill).

(I will assume we want to find max of abs of input vector):

  1. Select a monotonically decreasing function of absolute value, for example $$f(x) = \frac{1}{x^2}$$
  2. Build diagonal matrix of $f({\bf r})$. Let us call it $\bf C_r$
  3. Build vector full of ones $\bf S$.
  4. Build and solve equation system $(\epsilon {\bf I}+10^3{\bf S}^t{\bf S}+{\bf C_r})^{-1}(10^3 {\bf S}^t)$
  5. Let us call result vector $\bf p$, it will be a probability measure (sums to 1), we can reweigh it nonlinearly, for example $$p_i = \frac{p_i^k}{\sum|p_i|^k}$$
  6. Just calculate scalar product with index vector and round.
mathreadler
  • 111
  • 4