Can The linearly non-separable data be learned using polynomial features with logistic regression?

Question

I know that Polynomial Logistic Regression can easily learn a typical data like the following image:

I was wondering whether the following two data also can be learned using Polynomial Logistic Regression or not.

I guess I have to add more explanation. Assume the first shape. If we add extra polynomial features for this 2-D input (like x1^2 ...) we can make a decision boundary which can separate the data. Suppose I choose X1^2 + X2^2 = b. This can separate the data. If I add extra features I will get a wavy shape (maybe a wavy circle or wavy ellipsis) but it still can not separate the data of the second graph, can it?

Maybe it's a slip, but you're implicitly asking about classification, not regression... — Emre, Aug 02 '17 at 15:08
@Emre actually I'm asking about polynomial logistic regression, you are right :) — Green Falcon, Aug 02 '17 at 15:16
Where did you read about that? Did you mean multinomial logistic regression? — Emre, Aug 03 '17 at 03:35
in statics, its a common discussion, which has been extended to machine learning. I was wondering whether non-convex data can be learned or not. in neural networks for learning the second image you should have a two-hidden-layer network. — Green Falcon, Aug 03 '17 at 04:52
I don't know about statics, but I've never heard of polynomial logistic regression in statistics. I think you should look into kernel logistic regression if you are interested in nonlinear class boundaries. — Emre, Aug 03 '17 at 04:58
There is no reference to polynomial logistic regression there. Or any other kind of logistic regression. — Emre, Aug 03 '17 at 06:44
This is an extension for polynomial regression which is used for classification problem, you can find it in the exercise file of that unit. any way, that's a common approach sir. you may have seen its common one vs all approach. — Green Falcon, Aug 03 '17 at 07:08
@Emre: I think it is just a term for the usual logistic regression with feature engineering to add polynomial terms in $x_i$ such as $x_1^2$ or $x_2x_3$. — Neil Slater, Aug 08 '17 at 20:04
@Emre actually yes, you are right. It's some how feature engineering; but the point is that this kind of features can separate convex shapes (like the first one which is provided) I want to know whether we can learn non-convex shapes like the second graph. — Green Falcon, Aug 08 '17 at 20:12
@Emre also for seeing what I mean as the neural networks see here — Green Falcon, Aug 08 '17 at 20:12
@NeilSlater Yes, I figured as much, but no-one calls that polynomial logistic regression. The answer is yes, if you can reformulate boundary as an equation. For example, the first one is similar to (|x|-a)^2 + y^2 = b. Thus, adding terms for (|x|, x^2, y^2) will do the trick. Get the idea? — Emre, Aug 08 '17 at 20:28

Neil Slater · Accepted Answer · 2017-08-09T07:36:24.300

Yes in theory the polynomial extension to logistic regression can approximate any arbitrary classification boundary. That is because a polynomial can approximate any function (at least of the types useful to classification problems), and this is proven by the Stone-Weierstrass theorem.

Whether this approximation is practical for all boundary shapes is another matter. You may be better looking for other basis functions (e.g. Fourier series, or radial distance from example points), or other approaches entirely (e.g. SVM) when you suspect a complex boundary shape in feature space. The problem with using high order polynomials is that the number of polynomial features you need to use grows exponentially with degree of the polynomial and number of original features.

You could make a polynomial to classify XOR. $5 - 10 xy$ might be a start if you use $-1$ and $1$ as the binary inputs, this maps input $(x,y)$ to output as follows:

$$(-1,-1): -5 \qquad (-1,1): 5 \qquad (1,-1): 5 \qquad(1, 1): -5$$

Passing that into the logistic function should give you values close enough to 0 and 1.

Similar to your two circular areas is a simple figure-of-eight curve:

$$a(x^2 - y^2 - bx^4 + c)$$

where $a, b$ and $c$ are constants. You can get two disjoint closed areas defined in your classifier - on opposite sides of the $y$ axis, by choosing $a, b$ and $c$ appropriately. For example try $a=1,b=0.05,c=-1$ to get a function that clearly separates into two peaks around $x=-3$ and $x=3$:

The plot shown is from an online tool at academo.org, and is for $x^2 - y^2 - 0.05x^4 -1>0$ - the positive class shown as value 1 in the plot above, and is typically where $\frac{1}{1+e^{-z}} > 0.5$ in logistic regression or just $z>0$

An optimiser will find best values, you would just need to use $1, x^2, y^2, x^4$ as your expansion terms (although note these specific terms are limited to matching the same basic shape reflected around the $y$ axis - in practice you would want to have multiple terms up to fourth degree polynomial to find more arbitrary disjoint groups in a classifier).

In fact any problem you can solve with a deep neural network - of any depth - you can solve with a flat structure using linear regression (for regression problems) or logistic regression (for classification problems). It is "just" a matter of finding the right feature expansion. The difference is that neural networks will attempt to discover a working feature expansion directly, whilst feature engineering using polynomials or any other scheme is hard work and not always obvious how to even start: Consider for example how you might create polynomial approximations to what convolutional neural networks do for images? It seems impossible. It is likely to be extremely impractical, too. But it does exist.

are you sure about the first paragraph? non-convex shapes can really be learned using polynomials? — Green Falcon, Aug 08 '17 at 20:26
Please first think of this: neural networks (like MLP with one hidden layer) can not learn non-convex shapes, so we use at least two-hidden-layer nets to overcome the non-convex shapes. now here, are sure that we can overcome such situations? I guess polynomials are good just for XOR problem regime. — Green Falcon, Aug 08 '17 at 20:31
Suppose as the graphs we have a 2-D input and a label (0 or 1). if we have decision boundaries such as X1 ^ 2 + X2 ^ 2 = b this is a simple circle and by adding extra features we can get a wavy shape, but its convex. If I am wrong would you please bring a polynomial combination of features x1 and x2 to solve (separate) the second graph? — Green Falcon, Aug 08 '17 at 20:35
@NeilSlater I solved the circle one above. XOR is easy too: just add a term for "xy". — Emre, Aug 08 '17 at 20:47
@Emre: Thanks. I think to be purist though, |x| is not allowed, since it is a different non-linearity. — Neil Slater, Aug 08 '17 at 21:15
Why stop at polynomials? It's not as if we take the derivatives of the inputs, so use any function you want. — Emre, Aug 08 '17 at 21:21
@Emre Yes, I already suggest in the question that other functions could be useful. However, I think the point of the OP using the term "Polynomial Logistic Regression" is that it is intended to just use polynomials. I think there are some libraries that will do automatic polynomial expansion of features for use with otherwise linear optimisers. — Neil Slater, Aug 08 '17 at 21:37
@NeilSlater please look at here, again it separates the space to two parts. Actually as I have mentioned I know that MLP can learn everything (barely) but the point is for polynomial too?, also I have to say that at stanford people call that polynomial regression, and the idea is that we add polynomial terms to logistic regression and at optimization stage face it as the simple linear logistic regression, because our attempt is just to reduce the error, tnx for effort — Green Falcon, Aug 08 '17 at 21:52
This shape is used for regression I guess, isn't it? actually as I have studied for classification we are restricted to have lines in XY plane, The shape used here is used in RBF models (I have seen that there actually), sorry I did not get the point, the problem is classification — Green Falcon, Aug 08 '17 at 22:05
@NeilSlater as a matter of fact, the z is zero or one (this is the restriction) — Green Falcon, Aug 08 '17 at 22:07

Can The linearly non-separable data be learned using polynomial features with logistic regression?

1 Answers1

Linked