3

I have done linear and multivariate regression so I understand what probability, cost and gradient descent functions are. I do not understand what the last 3 formulas mean and how they relate to each other. I don't expect anyone to explain them to me but could someone please share a link that explains them in simple terms? I've tried looking online but the formulas used often look different from the ones below.

Sorry for posting a picture like this but I don't know how to format formulas.

Anya
  • 31
  • 1

2 Answers2

4

Logistic Regression is rather a hard algorithm to digest immediately as details often are abstracted away for the sake of simplicity for practitioners. To explain the idea behind logistic regression as a probabilistic model, we need to introduce the odds ratio, i.e. the odds in favor of a particular event. The odds ratio can be written $\frac{p}{1-p}$, where p stands for the probability of the positive event. We can then further define the logit function, which is simply the logarithm of the odds ratio (log-odds). The logit function takes input values in the range 0 to 1 and transforms them to values over the entire real number range, which we can use to express a linear relationship between feature values and the log-odds:

$logit \frac{p}{1-p} = w_{0}x_{0} + w_{1}x_{1}...w_{n}x_{n}= w^{T}x $ (1)

In Logistic Regression we are interested in predicting the probability that a certain sample belongs to a particular class, which is the inverse form of the logit function. It is also called the logistic function, sometimes simply abbreviated as sigmoid function:

$p(Y=1|x) = \phi(z=W^Tx) = \frac{1}{1+e^{-z}}$ (2)

$ p(Y=0|x) = 1 - p(Y=1|x) $ (3)

Here it comes the tricky part, the parameter estimations i.e. w (see eq. 1). It basically states that a solution to a Logistic Regression problem is the set of parameters w that maximizes the likelihood which is defined via maximum likelihood estimator (recall that we are going to define a loss function going from here). As logistic regression predicts probabilities, rather than just classes, we can fit it using likelihood:

$ L(w) = P(Y_{i}|w;x) = \prod_{i=1}^{n} \phi(z)^{Y_{i}}(1-\phi(z))^{1-Y_{i}}$ (4)

In practice, instead of maximizing this likelihood, it is easier to maximize the (natural) log of this equation, which is called the log-likelihood function:

$log(L(w)) = \sum\limits_{i=1}^n [Y_{i}log\phi(z)+(1-Y_{i})log(1-\phi(z))]$ (5)

Or to make it mathematically more convenient, we take the negative of this function, so that it can be minimized using gradient descent:

$ cost = - log(L(w)) = l_{i}(w) = \sum\limits_{i=1}^n [- Y_{i}log\phi(z)-(1-Y_{i})log(1-\phi(z))]$ (6)

The cost would look like this (taken from here): enter image description here

Here then we take derivative with respect to parameters w (see eq. 1), rather than z (in your photo it is with respect to $\alpha$). Those derivations would be like what you showed. Still I recommend checking out this blogspot that goes through these steps in more details, or parts of implementation in python here, if needed.

TwinPenguins
  • 4,249
  • 3
  • 19
  • 53
-1

I will suggest, take the start from here. enter link description here

Felix Tenn
  • 85
  • 6