1

Im studying the perceptron algorithm. I know that we can use the weights w as the coefficients of the hyperplane that will separate the vectors that need a classifications

In every web page i read up for a detailed explanation, it's generally said that the perceptron algorithm will find the optimal weights w. But optimal in which sense ?

Green Falcon
  • 14,058
  • 9
  • 57
  • 98
Poiera
  • 451
  • 1
  • 5
  • 9

2 Answers2

1

Media's explanation is true for regression problems. These are problems where you predict a continuous target variable.

Your image shows a classification problem. Here, the target variable takes only two values (typically -1 and 1 in the Perceptron algorithm). In that case, an optimal solution $w^*$ is a vector of weights that perfectly separates both classes. If such a solution exists, the Perceptron algorithm will find it. But: If there is one optimal solution, there are usually infinitely many other optimal solutions. You can easily see this in your image: You can move the line a little to the left or to the right, and you can rotate it a little, and it still perfectly separates the classes.

So while the Perceptron algorithm will find an optimal solution if there is one, you cannot know which one it will find. That depends on the random starting parameters.

This is different e.g. for support vector machines. Here, there is either no optimal solution or exactly one optimal solution.

Elias Strehle
  • 1,646
  • 10
  • 25
  • Take a look at here. In both regression and classification tasks, we attempt to reduce the error. Error is a function of weights, in that space we try to reduce the error. – Green Falcon Feb 24 '18 at 18:51
  • Yes, you are right. But the perceptron algorithm for classification minimizes the classification error $\frac1n \sum_{i=1}^n |\hat{y}_i-y_i|.$ This is not a strictly convex function in $\hat{y}.$ Therefore the minimum is usually not unique.

    The Wikipedia article explains this very well and lists some sources.

    – Elias Strehle Feb 24 '18 at 23:18
  • 1
    you are right, that is why people use cross-entropy nowadays for classification tasks. – Green Falcon Feb 25 '18 at 05:59
  • Absolutely! The question is about the perceptron algorithm, however. – Elias Strehle Feb 25 '18 at 09:07
  • The reason why the perceptron algorithm admits many optimal solutions is not its loss function but its activation function, which is a simple step function taking on two values. This activation function is so insensitive that you can usually wiggle around your optimal weights $w^*$ a bit without affecting its output.

    To get unique optima, you need to change the activation function. If you change it to the logistic function, you get logistic regression.

    – Elias Strehle Feb 25 '18 at 09:51
  • About logistic regression you are right but about the activation, I don't agree, unfortunately. Look, all the purpose of activation function is to add non-linearity. Step function does that. The problem is that you can have different weights having same error. Because the error function for mse is not convex. If you want to make it convex, you use cross entropy. Moreover, the reason people don't use perceptron is that it's derivative is always zero, people used 1 for updating, it was so rigid, sigmoid-like functions have slight changes which enables nice changes of weights. – Green Falcon Feb 25 '18 at 13:29
  • 1
    I think we are working with different definitions of 'perceptron algorithm.' For me, the perceptron algorithm for classification is a single neuron with the signum function as the activation function and classification error as the loss function. This would be the plain vanilla version that is usually taught first to beginners in machine learning.

    I am aware that you can use neurons with different activation functions (sigmoid functions, etc.) and different loss functions (cross-entropy, etc.); and for those, I absolutely agree with everything you said.

    – Elias Strehle Feb 25 '18 at 14:35
  • dear @EliasStrehle I do agree :) – Green Falcon Feb 26 '18 at 08:28
-1

Using perceptron, you specify a cost function, Mean Squared Error for regression tasks or maybe Cross Entropy for classification tasks. The input data are the constants and the weights are the parameters of your learning problem. When you specify the cost function, if you have error, the cost would be non-zero. You use algorithms like gradient descent to decrease the cost value. This is an optimization problem which you try to decrease the value of error. When we say Perceptron finds the optimal point, the reason is that the shape of cost function, e.g. MSE is convex and there is just one optimal point which gradient is zero there and the cost has the least possible value there. If you use neural networks, the cost with respect to its parameters, weights, is not convex and you usually can not find the optimal point.

I suggest you looking here and here for understanding more neural nets optimality.

Green Falcon
  • 14,058
  • 9
  • 57
  • 98