Normal Equation (Method II)
The use of Normal Equation to solve for the optimal \(\ \theta\) and find the hypothesis function is an alternate method to what we talked about in the last post and probably the easiest, but of course they have their own benefits and drawbacks.Here we don't need to define a cost function, therefore we don't need to select \(\ \alpha \); the learning rate or specify number of iterations. Same as before, X is our feature matrix and for the intercept term \(\ \theta_0\), a column of ones should be added to X before calculation.
\(\ X_{m\times {n+1}} = {\begin{bmatrix}1 \\ .\\ . \\. \\1 \end{bmatrix}}_{m\times1}+X_{m\times n} \)
There are times when the resultant \(\ n+1 \times n+1 \) matrix \(\ X^TX \) becomes non-invertible. The reasons causing it would be because of the existence of Linearly Dependent features and sometimes when there are too many features where number of training samples(m) exceeds number features(n). Solution for this is taking the Pseudo Inverse. Matlab & Octave provides pinv() function for this.
Taking the inverse of the \(\ n+1 \times n+1 \) matrix \(\ X^TX \) can be very expensive, in fact this operation alone has a time complexity of O(n3). This is one of the drawbacks of using Normal Equation since it can be really slow for a considerably large data set, for instance when \(\ n \geq 10000 \). For situations like this it's best to go for an iterative solution such as Gradient Descent.
No comments:
Post a Comment