Introduction to Linear Regression: Cost Function

October 29, 2014

Cost Function (Method I)

For calculating the cost in linear regression, typically we use Sum of Squared Error method(SSE)

\(\ J(\theta) = \frac{1}{2m}\sum_{i=1}^m{(h(x^{(i)})-{y^{(i)}})^2}\)

The goal is to minimize \(\ J(\theta)\) and figure out \(\ \theta\) values corresponding to the minimum cost. There are several optimization algorithms used to achieve this.

  • Gradient Descent
  • Conjugate Gradient Method
  • Broyden–Fletcher–Goldfarb–Shanno Algorithm (BFGS)
  • Limited-memory BFGS (L-BFGS)
We'll discuss the simplest of them briefly; the Gradient Descent algorithm.

Algorithim
Repeat until convergence{
         \(\ \theta_j := \theta_j -{\frac{\alpha}{m}}\frac{\partial J(\theta)}{\partial \theta_j} \) where \( j =\{0,..,n\} \)
}
\(\ \alpha  \) - Learning rate
\(\ m  \) - Number of training samples

How it works

The cost function for linear regression is always a Convex Function, hence it has only a global minimum. Consider the below graph of \(\ J(\theta)\) plotted against \(\ \theta_1\).
The length of each iteration depends on \(\ \alpha  \), the learning rate. If \(\ \alpha  \) is too small then the convergence can be slow. If \(\ \alpha  \) is too large it can overshoot the minimum and fail to converge. As illustrated in the graph when gradient descent gets closer to the minimum, the step size automatically  gets decreased. Once you keep track of \(\ J(\theta)\), we can always observe whether it is converging or diverging after each iteration. One important point to remember is when updating \(\ \theta\) in each iteration, it must be done simultaneously for each \(\ \theta\). Take a look at the example and the pseudocode below to clarify.

eg:
Consider the case where we have 2 feature variables \(\ x_1 \)  and \(\ x_2 \). So the general hypothesis function would be  \(\ h(x) = \theta_0+ \theta_1x_1+ \theta_2x_2 \). Then we need the partial derivatives of \(\ J(\theta)\) with respect to \(\ \theta\)  ie. \(\ \frac{\partial J(\theta)}{\partial \theta_j} \).

\(X = \begin{bmatrix}x_0  \\ x_1 \\ x_2 \end{bmatrix}\) where \(x_0 = 1 \)

\(\ \theta = \begin{bmatrix}\theta_0 \\ \theta_1 \\ \theta_2 \end{bmatrix}\) - Initial Theta Vector(typically initialized as a zero vector)
N - number of iterations

function gradientDescent(\(\ X , y,  \theta, \alpha, N \)):
     for 1 to N:
        \(\ temp\_\theta_0 = \frac{1}{m}\sum_{i=1}^m{(h(x^{(i)})-{y^{(i)})}}\);
        \(\ temp\_\theta_1 = \frac{1}{m}\sum_{i=1}^m{(h(x^{(i)})-{y^{(i)}}).x_1^{(i)}}\);
        \(\ temp\_\theta_2 = \frac{1}{m}\sum_{i=1}^m{(h(x^{(i)})-{y^{(i)}}).x_2^{(i)}}\);
        \(\ \theta_0 = temp\_\theta_0 \);
        \(\ \theta_1 = temp\_\theta_1 \);
        \(\ \theta_2 = temp\_\theta_2 \);
     end
     return \(\ \theta\);
end

NEXT: Normal Equation
<< previous  | 1 | 2 | 3| 4next >>

No comments:

Post a Comment