My book describes this as an equation for minimizing the $\theta$ value, but I have a few questions regarding the intuition behind this equation:
The book describes $j$ as the number of features. If we have to compute the $\theta$ value for every $j$, does this mean that the number of features $\left(x_1,\:x_2,...\right)$ is equal to the number of parameters $\left(\theta _1,\:\theta _2,...\right)$?
How are the initial $\theta$ and $\alpha$ values selected? What if the initial values selected are too low/ too high?
If anyone could clear up my confusion, that would be great. Thanks.
please refer to page 10
– someman112 Jun 27 '23 at 00:42