2

I am currently trying to implement logistic regression with iteratively reweightes LS, according to "Pattern Recognition and Machine Learning" by C. Bishop. In a first approach I tried to implement it in C#, where I used Gauss' algorithm to solve eq. 4.99. For a single feature it gave very promising (nearly exact) results, but whenever I tried to run it with more than one feature my system matrix became singular, and the weights did not converge. I first thought that it was my implementation, but when I implemented it in SciLab the results sustained. The SciLab (more concise due to matrix operators) code I used is

phi = [1; 0; 1; 1];
t = [1; 0; 0; 0];
w= [1];

w' * phi(1,:)'

for in=1:100
    y = [];
    R = zeros(size(phi,1));
    R_inv = zeros(size(phi,1));

    for i=1:size(phi,1)
        y(i) = 1/(1+ exp(-(w' * phi(i,:)')));
        R(i,i) = y(i)*(1 - y(i));
        R_inv(i,i) = 1/R(i,i);
    end

    z = phi * w - R_inv*(y - t)
    w = inv(phi'*R*phi)*phi'*R*z
end

With the values for phi (input/features) and t (output/classes), it yields a weight of -0.6931472, which is pretty much 1/3, which seems fine to me, for there is 1/3 probability of beeing assigned to class 1, if feature 1 is present (please forgive me, if my terms do not comply with ML-language completely, for I am an software developer). If I now added an intercept feature, which would accord to

phi = [1, 1; 1, 0; 1, 1; 1, 1];
w = [1; 1];

my R-matrix becomes singular and the last weights value is

w  =
  - 5.8151677  
  1.290D+30  

which - to my reading - would mean, that the probability of belonging to class 1 would be close to 1 if feature 1 is present about 3% for the rest. There has got to be any error I made, but I do not get which one. For both implementations yield the same results I suspect that there is some point I've been missing or gotten wrong, but I do not understand which one.

Paul K
  • 123
  • 4
  • This looks more like a programming question or maybe a stats question. Its not a data science question. Try StackOverflow? – Spacedman Sep 12 '14 at 13:10
  • 2
    I was not that clear about it either. I think it'll be quite hard for someone who's not into the topic to see the problem, but I may give it a try – Paul K Sep 12 '14 at 13:40
  • 1
    That's why I suggested stats and programming. Those would be the people into the topic. All you've got here are data miners and hadoop word-counters. – Spacedman Sep 14 '14 at 08:52

1 Answers1

3

This has been asked, and answered, on CrossValidated. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression and https://stats.stackexchange.com/questions/45803/logistic-regression-in-r-resulted-in-hauck-donner-phenomenon-now-what for two related answers (and there are other related questions that you can explore there). That logistic regression can blow up is a known effect in computational statistics.

Also, in situations where exp(-30) gives you roughly the relative accuracy of the double type, you would want to be extremely careful with accumulation of the round-off errors, as 1+exp(-30)=1: summing the likelihood contributions may run into numerical problems, especially when you start computing numerical derivatives and gradients. For a brief introduction into the issue in application to the typical problems in statistical computing and the specific problems it tends to encounter, see http://www.stata.com/meeting/nordic-and-baltic14/abstracts/materials/dk14_gould.pdf.

StasK
  • 310
  • 1
  • 5
  • Thank you very much, I'll definitely have a look at it for educational purposes. I fortunately found a ready-made implementation for my system which works for me. – Paul K Sep 23 '14 at 06:37