Gradient Descent: A Path to the Minimum
Gradient Descent is an iterative optimization algorithm that aims to find the minimum of a function. It operates on the principle of repeatedly taking steps in the direction of the negative gradient (the direction of steepest descent) of the function at the current point.
The Intuition
Imagine you're standing on a hillside and want to find the lowest point. You might look around and identify the direction of the steepest slope downhill and take a step in that direction. You'd repeat this process, gradually moving towards the valley floor (the minimum). Gradient Descent does something similar in a multi-dimensional space defined by the function's parameters.
The Math
Let's formalize this intuition with some math:
Cost Function (J(θ)): This is the function we want to minimize. It represents the error or discrepancy between our model's predictions and the actual data.
Gradient (∇J(θ)): This is a vector that points in the direction of the steepest ascent of the cost function at a particular point θ. It's calculated by taking the partial derivatives of the cost function with respect to each parameter.
Update Rule: Gradient Descent iteratively updates the parameters using the following rule:
θt+1=θt−α∇J(θt)Where:
θt: The current values of the parameters at iteration t.
θt+1: The updated values of the parameters for the next iteration.
α: The learning rate, which controls the step size taken in the direction of the negative gradient.
∇J(θt): The gradient of the cost function evaluated at the current parameter values.
Why it Works
The key to understanding why Gradient Descent works lies in the negative gradient:
Negative Gradient (-∇J(θ)): This vector points in the direction of the steepest descent of the cost function. By taking a step in this direction, we move towards lower values of the cost function, closer to the minimum.
Learning Rate (α): The learning rate determines the size of the step we take. A smaller learning rate results in smaller steps, offering finer control but potentially requiring more iterations. A larger learning rate leads to bigger steps, potentially converging faster but risking overshooting the minimum.
Convergence
Gradient Descent continues iteratively applying the update rule until it reaches a point where the gradient is very close to zero or a predefined number of iterations is completed. At this point, it's considered to have converged to a local minimum of the cost function.
In essence, Gradient Descent works because it systematically follows the direction of steepest descent, guided by the negative gradient, gradually moving towards a minimum of the function.
In this example, we will demonstrate the use of gradient descent to find a local minimum of the Himmelblau's function, which is a classic example of a non-convex optimization problem. Since we already have the Himmelbau's function's explicit analytical form. We can get its explicit gradient functions.
- Himmelblau's Function:
The Himmelblau's function is defined as:
f(x,y)=(x2+y−11)2+(x+y2−7)2This function has four local minima.
- Gradient of Himmelblau's Function:
The gradient of the Himmelblau's function is a vector of its partial derivatives with respect to x and y:
∇f(x,y)=[∂f∂x∂f∂y]=[4x(x2+y−11)+2(x+y2−7)2(x2+y−11)+4y(x+y2−7)]The code calculates this gradient in the himmelblau_gradient function.
- Gradient Descent Algorithm:
Gradient descent is an iterative optimization algorithm that aims to find a local minimum of a function. It works by repeatedly taking steps in the opposite direction of the gradient. The update rule for gradient descent is:
xt+1=xt−α∇f(xt)where:
xt is the current point at iteration t. α is the learning rate, which controls the step size. ∇f(xt) is the gradient of the function at the current point. The code implements this algorithm in the gradient_descent function.
- Visualization:
The code uses matplotlib to visualize the optimization process:
It plots the contour lines of the Himmelblau's function, which represent points of equal function value. It plots the optimization path, showing how the algorithm moves from the starting point to the final point. It marks the start and end points with different colors. In essence, the code performs the following steps:
Defines the Himmelblau's function and its gradient. Implements the gradient descent algorithm. Sets the starting point, learning rate, and number of iterations. Runs the gradient descent algorithm to find a local minimum. Visualizes the results.
# Use gradient descent to optimize non-convex optimization problems
import numpy as np
import matplotlib.pyplot as plt
# Define the Himmelblau's function (a non-convex function)
def himmelblau(x):
return (x[0]**2 + x[1] - 11)**2 + (x[0] + x[1]**2 - 7)**2
# Gradient of Himmelblau's function
def himmelblau_gradient(x):
grad = np.zeros(2)
grad[0] = 4*x[0]*(x[0]**2 + x[1] - 11) + 2*(x[0] + x[1]**2 - 7)
grad[1] = 2*(x[0]**2 + x[1] - 11) + 4*x[1]*(x[0] + x[1]**2 - 7)
return grad
# Gradient Descent
def gradient_descent(start_point, learning_rate, iterations):
x = np.array(start_point, dtype=float)
history = [x.copy()] # Store the optimization path
for i in range(iterations):
grad = himmelblau_gradient(x)
x -= learning_rate * grad
history.append(x.copy())
return np.array(history)
# Parameters
start_point = [-4, 4] # Initial point
learning_rate = 0.01
iterations = 500
# Run Gradient Descent
optimization_path = gradient_descent(start_point, learning_rate, iterations)
# Plot the results
x = np.arange(-6, 6, 0.1)
y = np.arange(-6, 6, 0.1)
X, Y = np.meshgrid(x, y)
Z = himmelblau([X, Y])
plt.figure(figsize=(10, 6))
plt.contour(X, Y, Z, 50) # Plot contours of the Himmelblau function
plt.plot(optimization_path[:, 0], optimization_path[:, 1], marker='o', markersize=3, linestyle='-')
plt.scatter(optimization_path[0, 0], optimization_path[0, 1], color='red', label='Start')
plt.scatter(optimization_path[-1, 0], optimization_path[-1, 1], color='green', label='End') #Final point
plt.xlabel('x')
plt.ylabel('y')
plt.title('Gradient Descent on Himmelblau\'s Function')
plt.legend()
plt.grid(True)
plt.show()
print(f"Final Point: {optimization_path[-1]}")
print(f"Himmelblau's value at final point: {himmelblau(optimization_path[-1])}")
Final Point: [-2.80511809 3.13131252] Himmelblau's value at final point: 7.888609052210118e-31
In the previous example, given a cost function f(a,b)=(a2+b−11)2+(a+b2−7)2, we are able to find its minimum with gradient descent.
Next, we demonstrate another important application of gradient descent technique. This time, the cost function will be J(θ0,θ1)=12mm∑i=1(θ0+θ1X(i)−y(i))2, finding its minimum value with gradient descent is the so called least square regression.
In the following example, we have a set of samples (xi, yi), our goal is to find a line that best fits the sample points. With the sample points [(1,2), (2,4), (3,5), (4,4), (5,5)], the cost function is:
f(a,b)=12⋅5[(a+b−2)2+(a+2b−4)2+(a+3b−5)2+(a+4b−4)2+(a+5b−5)2]=12⋅5(5a2+30ab−40a+55b2−132b+86)We can choose to solve a, b from the df/da= 0 and df/db = 0. However, when the set of samples are large, simplify f(a,b) then solve a, b exactly will be computationally expensive.
Instead, we apply gredient descent to solve a, b.
- The Model:
We're assuming a linear relationship between the input variable X and the output variable y. The model is represented as:
ˆy=θ0+θ1Xwhere:
ˆy is the predicted value of y.
θ0 is the intercept (bias term).
θ1 is the slope (weight of X).
X is the input variable.
- The Cost Function:
We want to minimize the difference between the predicted values (ˆy) and the actual values (y). We use the Mean Squared Error (MSE) as the cost function:
J(θ0,θ1)=12mm∑i=1(ˆy(i)−y(i))2where:
J(θ0,θ1) is the cost function.
m is the number of data points.
ˆy(i) is the prediction for the i-th data point.
y(i) is the actual value for the i-th data point.
- Gradient Descent:
The gradient descent algorithm iteratively updates the parameters (θ0, θ1) to minimize the cost function. The update rules are:
θ0:=θ0−α∂J(θ0,θ1)∂θ0θ1:=θ1−α∂J(θ0,θ1)∂θ1where:
α is the learning rate (controls the step size). ∂J(θ0,θ1)∂θ0 and ∂J(θ0,θ1)∂θ1 are the partial derivatives of the cost function with respect to θ0 and θ1, respectively.
- Partial Derivatives:
The partial derivatives are calculated as follows:
∂J(θ0,θ1)∂θ0=1mm∑i=1(ˆy(i)−y(i))∂J(θ0,θ1)∂θ1=1mm∑i=1(ˆy(i)−y(i))X(i)In essence:
We start with initial values for θ0 and θ1. We repeatedly apply the gradient descent update rules, calculating the partial derivatives and adjusting the parameters. We continue this process until the cost function converges to a minimum or a predefined number of iterations is reached.
# let's solve basic 2 dimensional least square regression
import numpy as np
import matplotlib.pyplot as plt
# Sample data (replace with your actual data)
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Least squares regression using gradient descent
def least_squares_gd(X, y, learning_rate=0.01, iterations=1000, threshold=1e-6):
m = len(X)
theta = np.random.randn(2) # Initialize theta (intercept and slope)
history = [theta.copy()]
errors = []
for i in range(iterations):
# Predict y values
y_pred = theta[0] + theta[1] * X
# Calculate gradients
grad_theta0 = (1/m) * np.sum(y_pred - y)
grad_theta1 = (1/m) * np.sum((y_pred - y) * X)
gradients = np.array([grad_theta0, grad_theta1])
# Update theta using gradient descent
theta -= learning_rate * gradients
history.append(theta.copy())
#Calculate error
error = np.mean((y_pred - y)**2)
errors.append(error)
#Check for convergence
if i > 0 and abs(errors[-1] - errors[-2]) < threshold:
print(f"Converged at iteration {i}")
break
return theta, np.array(history), errors
# Perform gradient descent
# theta, theta_history, errors = least_squares_gd(X, y, 0.01, 10000)
#decrease learning rate, threshold will give better solution, but more expensive
theta, theta_history, errors = least_squares_gd(X, y, 0.001, 100000, 1e-10)
# Make predictions
y_pred = theta[0] + theta[1] * X
# Plot the results
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label='Sample Data')
plt.plot(X, y_pred, color='red', label='Regression Line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Least Squares Regression with Gradient Descent')
plt.legend()
plt.grid(True)
# Plot the optimization path
plt.figure(figsize=(8, 6))
plt.plot(theta_history[:, 0], theta_history[:, 1], marker='o', markersize=3, linestyle='-')
plt.scatter(theta_history[0, 0], theta_history[0, 1], color='red', label='Start')
plt.scatter(theta_history[-1, 0], theta_history[-1, 1], color='green', label='End') #Final point
plt.xlabel("Theta0")
plt.ylabel("Theta1")
plt.title("Optimization Path")
plt.legend()
#Plot the error
plt.figure(figsize=(8,6))
plt.plot(errors)
plt.xlabel("Iteration")
plt.ylabel("Error")
plt.title("Error vs Iteration")
plt.show()
print(f"Intercept (theta0): {theta[0]}")
print(f"Slope (theta1): {theta[1]}")
Converged at iteration 41782
Intercept (theta0): 2.198725725318917 Slope (theta1): 0.6003529536535999
# using sympy , given x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 4, 5, 4, 5]),
# xi, yi are dots in x, y, calculate sum of symbol (a + b*xi - yi)^2 for xi, yi, simplify it
import sympy
import numpy as np
# Define symbols
a, b = sympy.symbols('a b')
# Given data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Calculate the sum of squared differences
sum_of_squares = 0
for xi, yi in zip(x, y):
sum_of_squares += (a + b * xi - yi)**2
# Divide by 10 and simplify
simplified_expression = (sum_of_squares / 10).simplify()
simplified_expression
# given the above f(a, b) expression, solve a, b for df/da=0, df/db = 0
import numpy as np
import sympy
# Define symbols
a, b = sympy.symbols('a b')
# Given data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Calculate the sum of squared differences
sum_of_squares = 0
for xi, yi in zip(x, y):
sum_of_squares += (a + b * xi - yi)**2
# Divide by 10 and simplify
f = (sum_of_squares / 10).simplify()
# Calculate partial derivatives
df_da = sympy.diff(f, a)
df_db = sympy.diff(f, b)
# Solve for df/da = 0 and df/db = 0
solution = sympy.solve((df_da, df_db), (a, b))
solution
{a: 11/5, b: 3/5}
# solve the about problem with sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
# Sample data (replace with your actual data)
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Reshape X to a 2D array
y = np.array([2, 4, 5, 4, 5])
# Create and fit the linear regression model
model = LinearRegression()
model.fit(X, y)
# Make predictions
y_pred = model.predict(X)
# Print the coefficients
print(f"Intercept (theta0): {model.intercept_}")
print(f"Slope (theta1): {model.coef_[0]}")
# Plot the results
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label='Sample Data')
plt.plot(X, y_pred, color='red', label='Regression Line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Least Squares Regression with scikit-learn')
plt.legend()
plt.grid(True)
plt.show()
Intercept (theta0): 2.2 Slope (theta1): 0.6
Multi-dimensional Least Squares Regression:
The goal is to find the best-fitting hyperplane (a plane in more than two dimensions) that minimizes the sum of squared errors between the predicted values and the actual values.
Formula:
The model for multi-dimensional linear regression can be represented as:
ˆy=θ0+θ1X1+θ2X2+...+θnXnWhere:
ˆy is the predicted value.
θ0 is the intercept.
θ1,θ2,...,θn are the coefficients (weights) for each feature.
X1,X2,...,Xn are the input features.
Cost Function:
Similar to the 2D case, the cost function we want to minimize is the Mean Squared Error (MSE):
J(θ)=12mm∑i=1(ˆy(i)−y(i))2Where:
J(θ) is the cost function.
m is the number of data points.
ˆy(i) is the prediction for the i-th data point.
y(i) is the actual value for the i-th data point.
Gradient Descent:
The code uses gradient descent to find the optimal values for the coefficients (θ0,θ1,...,θn) that minimize the cost function. The update rule for each coefficient is:
θj:=θj−α∂J(θ)∂θjWhere:
α is the learning rate. ∂J(θ)∂θjis the partial derivative of the cost function with respect to θj.
In essence, the code aims to:
Generate or load multi-dimensional data (X and y). Implement the multi-dimensional least squares regression model. Use gradient descent to find the optimal coefficients. Evaluate the model and potentially visualize the results.
In code: gradients = (1/m) * X_with_intercept.T.dot(y_pred - y)
This formula calculates the gradients for the model's parameters in multi-dimensional least squares regression using gradient descent.
∇J(θ)=1mXT(Xθ−y)Where:
∇J(θ): Represents the gradient of the cost function J(θ). It's a vector containing the partial derivatives of the cost function with respect to each parameter. This is what the gradients variable stores in the code.
X: Represents the design matrix. It's a matrix where each row represents a data point and each column represents a feature. X_with_intercept in the code represents this design matrix with an added column of 1s for the intercept term.
XT: Represents the transpose of the design matrix. This is obtained by X_with_intercept.T in the code.
θ: Represents the model's parameters (coefficients). These are the values that the gradient descent algorithm aims to optimize. theta in the code is what stores these parameters.
y: Represents the actual target values for the data points. This is the y variable in the code.
m: Represents the number of data points in the dataset. This is obtained by m in the code.
# let's apply to multi dimensional least square regression
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Multi-dimensional Least Squares Regression
def multi_dim_least_squares_gd(X, y, learning_rate=0.01, iterations=1000, threshold=1e-6):
m, n = X.shape # m data points, n features
theta = np.random.randn(n + 1) # Initialize theta (intercept and slopes)
history = [theta.copy()]
errors = []
for i in range(iterations):
# Predict y values
# adds a column of 1s to the X_batch to represent the intercept term in the linear regression model
X_with_intercept = np.column_stack((np.ones(m), X))
#ŷ = θ₀ + θ₁X₁ + θ₂X₂ + ... + θₙXₙ
y_pred = X_with_intercept.dot(theta)
# Calculate the error, Mean Squared Error (MSE)
error = np.mean((y_pred - y)**2) / 2
errors.append(error)
# Calculate gradients
gradients = (1/m) * X_with_intercept.T.dot(y_pred - y)
# Update theta using gradient descent
theta -= learning_rate * gradients
history.append(theta.copy())
# Check for convergence
if i > 0 and abs(errors[-1] - errors[-2]) < threshold:
print(f"Converged at iteration {i}")
break
return theta, np.array(history), errors
# Sample 3D Data (replace with your data)
X = np.random.rand(100, 2) # 100 data points, 2 features
y = 2*X[:, 0] + 3*X[:, 1] + 1 + 0.1*np.random.randn(100) # Linear relationship with noise
# Perform gradient descent
#decrease learning rate, threshold for better solution
theta, theta_history, errors = multi_dim_least_squares_gd(X, y, learning_rate=0.01, iterations=10000)
# Plot the regression plane (3D)
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], y, label='Data')
xx, yy = np.meshgrid(np.linspace(0, 1, 10), np.linspace(0, 1, 10))
zz = theta[0] + theta[1] * xx + theta[2] * yy
ax.plot_surface(xx, yy, zz, alpha=0.5) # Plot the fitted plane
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('y')
ax.set_title('3D Regression Plane')
# Plot the optimization path in 3D
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111, projection='3d')
ax.plot(theta_history[:,0], theta_history[:,1], theta_history[:,2], marker='o', markersize=3, linestyle='-')
ax.scatter(theta_history[0,0],theta_history[0,1], theta_history[0,2], color='red', label='Start')
ax.scatter(theta_history[-1,0], theta_history[-1,1], theta_history[-1,2], color='green', label='End')
ax.set_xlabel("theta0")
ax.set_ylabel("theta1")
ax.set_zlabel("theta2")
ax.legend()
# Plot error values over iterations
plt.figure()
plt.plot(errors)
plt.axhline(y=1e-6, color='r', linestyle='--', label='Threshold') #Threshold line
plt.xlabel("Iterations")
plt.ylabel("Error")
plt.title("Error convergence")
plt.legend()
plt.show()
print(f"Optimized theta: {theta}")
Converged at iteration 3669
Optimized theta: [1.08720352 1.93276973 2.91362936]
SGD: Embracing Randomness for Efficiency
Stochastic Gradient Descent (SGD) is a variation of Gradient Descent that introduces randomness into the optimization process. Instead of using the entire dataset to calculate the gradient in each iteration, SGD uses only a single data point (or a small batch of data points) at a time.
The Motivation
While regular Gradient Descent (GD) can be effective, it can become computationally expensive for large datasets, as it requires processing the entire dataset in each iteration. SGD addresses this issue by using a subset of the data, making it significantly faster, especially for large-scale machine learning tasks.
The Math
The update rule for SGD is similar to that of GD but uses the gradient calculated from a single data point or a mini-batch:
θt+1=θt−α∇J(θt,x(i),y(i))Where:
θt: The current values of the parameters at iteration t.
θt+1: The updated values of the parameters for the next iteration.
α: The learning rate, which controls the step size taken in the direction of the negative gradient.
∇J(θt, x(i), y(i)): The gradient of the cost function evaluated at the current parameter values using only the data point (x(i), y(i)) or a mini-batch of data points.
Why it Works
Noise and Exploration: The randomness introduced by using a single data point or a mini-batch creates noise in the gradient updates. This noise can be beneficial as it allows SGD to escape local minima and explore a wider range of the parameter space, potentially finding a better solution than GD.
Faster Updates: By processing only a subset of the data, SGD can perform updates much faster than GD, especially for large datasets. This allows it to make progress towards the minimum more quickly, even though the updates may be noisier.
Approximation of the True Gradient: While the gradient calculated from a single data point or a mini-batch is an approximation of the true gradient (calculated using the entire dataset), it is often sufficient for making progress towards the minimum. As the algorithm iterates through the data multiple times, the effects of the noise tend to average out, leading to convergence.
Convergence Considerations
Due to the noisy updates, SGD typically exhibits more oscillations during training compared to GD. However, with a properly chosen learning rate and enough iterations, SGD can converge to a good solution.
In essence, SGD works by leveraging the efficiency of using subsets of data for gradient updates while embracing the randomness introduced by this approach to explore the parameter space and escape local minima.
Why SGD?
While regular gradient descent updates the model parameters using the entire dataset in each iteration, SGD updates the parameters using only a single data point (or a small batch of data points) at a time. This makes SGD much faster, especially for large datasets, but it also introduces more randomness into the optimization process.
Stochastic Gradient Descent (SGD)
The core idea of SGD is to update the parameters (θ) using the gradient calculated from a single data point (or a small batch) instead of the entire dataset.
Update Rule:
The update rule for SGD can be represented as:
θj:=θj−α∂J(θ,x(i),y(i))∂θjwhere:
θj represents the j-th parameter.
α is the learning rate.
x(i) and y(i) represent a single data point or a small batch of data points.
∂J(θ,x(i),y(i))∂θj is the partial derivative of the cost function with respect to θj, calculated using only the selected data point(s).
Cost Function (MSE):
The cost function (Mean Squared Error) remains the same as in regular gradient descent:
J(θ)=12mm∑i=1(ˆy(i)−y(i))2However, in SGD, we calculate the partial derivatives and update the parameters using only a single data point (or a small batch) at a time.
Steps in SGD:
Initialization: Initialize the parameters (θ) randomly. Iteration: For a fixed number of iterations (or until convergence): Shuffle: Randomly shuffle the dataset. Mini-Batch Selection: Select a single data point or a small batch of data points. Gradient Calculation: Calculate the gradient of the cost function with respect to the parameters using only the selected data point(s). Parameter Update: Update the parameters using the calculated gradient and the learning rate. Convergence: Check for convergence. If the cost function has not decreased significantly over a certain number of iterations, stop the process. Advantages of SGD:
Faster convergence, especially for large datasets. Can escape local minima due to its inherent randomness. Disadvantages of SGD:
Noisier updates can lead to oscillations during training. Requires careful tuning of the learning rate.
In the code:
gradients = (1/len(X_batch)) * X_batch_with_intercept.T.dot(y_pred - y_batch)
implements the gradient calculation:
∇J(θ)=1mXT(Xθ−y)Where:
∇J(θ): Represents the gradient of the cost function J(θ).
X: Represents the design matrix (input features with an added column of 1s for the intercept).
XT (X_batch_with_intercept.T): Represents the transpose of the design matrix.
θ (y_pred): Represents the model's parameters (coefficients).
y (y_batch): Represents the actual target values.
m: Represents the number of data points in the mini-batch. compare SGD with GD. This m is much smaller, saved tons of computation for large sample set
# let's solve the same problem, but use stochastic gradient descent instead
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Multi-dimensional Least Squares Regression with Stochastic Gradient Descent
def multi_dim_least_squares_sgd(X, y, learning_rate=0.01, iterations=1000, threshold=1e-6, batch_size=1):
m, n = X.shape # m data points, n features
theta = np.random.randn(n + 1) # Initialize theta (intercept and slopes)
history = [theta.copy()]
errors = []
for i in range(iterations):
# Shuffle data for stochasticity
idx = np.random.permutation(m)
X_shuffled = X[idx]
y_shuffled = y[idx]
for j in range(0, m, batch_size):
# Select a mini-batch
X_batch = X_shuffled[j:min(j + batch_size, m)]
y_batch = y_shuffled[j:min(j + batch_size, m)]
# Predict y values
X_batch_with_intercept = np.column_stack((np.ones(len(X_batch)), X_batch))
y_pred = X_batch_with_intercept.dot(theta)
# Calculate the error
error = np.mean((y_pred - y_batch)**2) / 2
errors.append(error)
# Calculate gradients (for the mini-batch)
gradients = (1/len(X_batch)) * X_batch_with_intercept.T.dot(y_pred - y_batch)
# Update theta using stochastic gradient descent
theta -= learning_rate * gradients
history.append(theta.copy())
# Check for convergence (based on average error in the epoch)
if i > 0 and abs(np.mean(errors[-m:]) - np.mean(errors[-(2*m):-m])) < threshold :
print(f"Converged at iteration {i}")
break
return theta, np.array(history), errors
# ... (rest of your existing code for data generation, plotting, etc.)
# Perform gradient descent
theta, theta_history, errors = multi_dim_least_squares_sgd(X, y, learning_rate=0.01, iterations=10000)
# Plot the regression plane (3D)
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], y, label='Data')
xx, yy = np.meshgrid(np.linspace(0, 1, 10), np.linspace(0, 1, 10))
zz = theta[0] + theta[1] * xx + theta[2] * yy
ax.plot_surface(xx, yy, zz, alpha=0.5) # Plot the fitted plane
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('y')
ax.set_title('3D Regression Plane')
# Plot the optimization path in 3D
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111, projection='3d')
ax.plot(theta_history[:,0], theta_history[:,1], theta_history[:,2], marker='o', markersize=3, linestyle='-')
ax.scatter(theta_history[0,0],theta_history[0,1], theta_history[0,2], color='red', label='Start')
ax.scatter(theta_history[-1,0], theta_history[-1,1], theta_history[-1,2], color='green', label='End')
ax.set_xlabel("theta0")
ax.set_ylabel("theta1")
ax.set_zlabel("theta2")
ax.legend()
# Plot error values over iterations
plt.figure()
plt.plot(errors)
plt.axhline(y=1e-6, color='r', linestyle='--', label='Threshold') #Threshold line
plt.xlabel("Iterations")
plt.ylabel("Error")
plt.title("Error convergence")
plt.legend()
plt.show()
print(f"Optimized theta: {theta}")
Converged at iteration 92
Optimized theta: [1.01902487 1.95956908 3.01781999]
# call a package method to solve this regression problem using scikit-learn.
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Sample 3D Data (replace with your data)
#X = np.random.rand(100, 2) # 100 data points, 2 features
#y = 2 * X[:, 0] + 3 * X[:, 1] + 1 + 0.1 * np.random.randn(100) # Linear relationship with noise
# Create and train the linear regression model
model = LinearRegression()
model.fit(X, y)
# Make predictions
y_pred = model.predict(X)
# Print the coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)
# Plot the regression plane (3D)
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], y, label='Data')
xx, yy = np.meshgrid(np.linspace(0, 1, 10), np.linspace(0, 1, 10))
zz = model.intercept_ + model.coef_[0] * xx + model.coef_[1] * yy
ax.plot_surface(xx, yy, zz, alpha=0.5) # Plot the fitted plane
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('y')
ax.set_title('3D Regression Plane using scikit-learn')
plt.show()
Intercept: 1.0145740327057973 Coefficients: [1.96275131 3.01876375]
No comments:
Post a Comment