Cost Function (Method I)
For calculating the cost in linear regression, typically we use Sum of Squared Error method(SSE)
The goal is to minimize \(\ J(\theta)\) and figure out \(\ \theta\) values corresponding to the minimum cost. There are several optimization algorithms used to achieve this.
\(\ J(\theta) = \frac{1}{2m}\sum_{i=1}^m{(h(x^{(i)})-{y^{(i)}})^2}\)
The goal is to minimize \(\ J(\theta)\) and figure out \(\ \theta\) values corresponding to the minimum cost. There are several optimization algorithms used to achieve this.
- Gradient Descent
- Conjugate Gradient Method
- Broyden–Fletcher–Goldfarb–Shanno Algorithm (BFGS)
- Limited-memory BFGS (L-BFGS)
We'll discuss the simplest of them briefly; the Gradient Descent algorithm.
Algorithim
\(\ m \) - Number of training samples
How it works
The cost function for linear regression is always a Convex Function, hence it has only a global minimum. Consider the below graph of \(\ J(\theta)\) plotted against \(\ \theta_1\).
eg:
Consider the case where we have 2 feature variables \(\ x_1 \) and \(\ x_2 \). So the general hypothesis function would be \(\ h(x) = \theta_0+ \theta_1x_1+ \theta_2x_2 \). Then we need the partial derivatives of \(\ J(\theta)\) with respect to \(\ \theta\) ie. \(\ \frac{\partial J(\theta)}{\partial \theta_j} \).
\(X = \begin{bmatrix}x_0 \\ x_1 \\ x_2 \end{bmatrix}\) where \(x_0 = 1 \)
\(\ \theta = \begin{bmatrix}\theta_0 \\ \theta_1 \\ \theta_2 \end{bmatrix}\) - Initial Theta Vector(typically initialized as a zero vector)
N - number of iterations
Algorithim
Repeat until convergence{
\(\ \theta_j := \theta_j -{\frac{\alpha}{m}}\frac{\partial J(\theta)}{\partial \theta_j} \) where \( j =\{0,..,n\} \)
}
\(\ \alpha \) - Learning rate\(\ \theta_j := \theta_j -{\frac{\alpha}{m}}\frac{\partial J(\theta)}{\partial \theta_j} \) where \( j =\{0,..,n\} \)
}
\(\ m \) - Number of training samples
How it works
The cost function for linear regression is always a Convex Function, hence it has only a global minimum. Consider the below graph of \(\ J(\theta)\) plotted against \(\ \theta_1\).
The length of each iteration depends on \(\ \alpha \), the learning rate. If \(\ \alpha \) is too small then the convergence can be slow. If \(\ \alpha \) is too large it can overshoot the minimum and fail to converge. As illustrated in the graph when gradient descent gets closer to the minimum, the step size automatically gets decreased. Once you keep track of \(\ J(\theta)\), we can always observe whether it is converging or diverging after each iteration. One important point to remember is when updating \(\ \theta\) in each iteration, it must be done simultaneously for each \(\ \theta\). Take a look at the example and the pseudocode below to clarify.
eg:
Consider the case where we have 2 feature variables \(\ x_1 \) and \(\ x_2 \). So the general hypothesis function would be \(\ h(x) = \theta_0+ \theta_1x_1+ \theta_2x_2 \). Then we need the partial derivatives of \(\ J(\theta)\) with respect to \(\ \theta\) ie. \(\ \frac{\partial J(\theta)}{\partial \theta_j} \).
\(X = \begin{bmatrix}x_0 \\ x_1 \\ x_2 \end{bmatrix}\) where \(x_0 = 1 \)
\(\ \theta = \begin{bmatrix}\theta_0 \\ \theta_1 \\ \theta_2 \end{bmatrix}\) - Initial Theta Vector(typically initialized as a zero vector)
N - number of iterations
function gradientDescent(\(\ X , y, \theta, \alpha, N \)):
for 1 to N:
\(\ temp\_\theta_0 = \frac{1}{m}\sum_{i=1}^m{(h(x^{(i)})-{y^{(i)})}}\);
\(\ temp\_\theta_1 = \frac{1}{m}\sum_{i=1}^m{(h(x^{(i)})-{y^{(i)}}).x_1^{(i)}}\);
\(\ temp\_\theta_2 = \frac{1}{m}\sum_{i=1}^m{(h(x^{(i)})-{y^{(i)}}).x_2^{(i)}}\);
\(\ \theta_0 = temp\_\theta_0 \);
\(\ \theta_1 = temp\_\theta_1 \);
\(\ \theta_2 = temp\_\theta_2 \);
end
return \(\ \theta\);
end
NEXT: Normal Equation
No comments:
Post a Comment