Performing model selection on predictors using a t-test involves determining whether each predictor variable in a regression model has a statistically significant effect on the dependent variable. If a predictor does not significantly contribute to explaining the variation in the target variable, it may be removed from the model to improve efficiency and interpretability.
Steps to Perform Model Selection Using a t-Test
Fit a Regression Model
- Choose a regression model, such as linear regression.
- Fit the model using all available predictor variables.
Extract t-Statistics and p-Values
- The t-test evaluates the null hypothesis that a predictor’s coefficient is zero (meaning the predictor has no effect).
- Compute the t-statistic:
where:
- ˆβj is the estimated coefficient for predictor j,
- SE(ˆβj) is the standard error of the coefficient.
Assess Statistical Significance
- The p-value from the t-test indicates the probability of observing the coefficient given that the predictor has no actual effect.
- Typically, predictors with p-values greater than 0.05 (5%) are considered statistically insignificant.
Eliminate Insignificant Predictors
- If a predictor’s p-value is above a chosen threshold (e.g., 0.05), remove it from the model.
- Refit the model with the remaining predictors.
Repeat Until All Predictors are Significant
- Continue the process iteratively until only significant predictors remain.
In the following example, we removed predictors X3 based on its p value 0.748. We choose statsmodels.api.OLS(y, X).fit() instead of sklearn.linear_model.LinearRegression().fit(X, y) because sklearn doesn't directly provide p-values or confidence intervals like statsmodels.
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Generate synthetic data
np.random.seed(42)
n = 100
X1 = np.random.randn(n)
X2 = np.random.randn(n)
X3 = np.random.randn(n)
y = 3 + 2 * X1 + 0.5 * X2 + np.random.randn(n) # X3 is irrelevant
# Create DataFrame
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'y': y})
# Add constant term for intercept
X = sm.add_constant(df[['X1', 'X2', 'X3']])
y = df['y']
# Fit the regression model
model = sm.OLS(y, X).fit()
# Display summary (includes t-test results)
print(model.summary())
# Select predictors based on p-values
significant_vars = model.pvalues[model.pvalues < 0.05].index.tolist()
# Remove non-significant variables and refit model
if 'const' in significant_vars:
significant_vars.remove('const') # Keep the intercept
X_new = sm.add_constant(df[significant_vars])
model_new = sm.OLS(y, X_new).fit()
# Display updated model summary, notice X3 is removed
print(model_new.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.784 Model: OLS Adj. R-squared: 0.778 Method: Least Squares F-statistic: 116.4 Date: Mon, 17 Mar 2025 Prob (F-statistic): 7.26e-32 Time: 03:57:07 Log-Likelihood: -127.46 No. Observations: 100 AIC: 262.9 Df Residuals: 96 BIC: 273.3 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.0875 0.089 34.599 0.000 2.910 3.265 X1 1.8227 0.100 18.140 0.000 1.623 2.022 X2 0.4618 0.094 4.913 0.000 0.275 0.648 X3 0.0269 0.083 0.322 0.748 -0.139 0.193 ============================================================================== Omnibus: 1.353 Durbin-Watson: 1.821 Prob(Omnibus): 0.508 Jarque-Bera (JB): 1.317 Skew: 0.169 Prob(JB): 0.518 Kurtosis: 2.551 Cond. No. 1.37 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.784 Model: OLS Adj. R-squared: 0.780 Method: Least Squares F-statistic: 176.2 Date: Mon, 17 Mar 2025 Prob (F-statistic): 5.08e-33 Time: 03:57:07 Log-Likelihood: -127.52 No. Observations: 100 AIC: 261.0 Df Residuals: 97 BIC: 268.8 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.0899 0.089 34.907 0.000 2.914 3.266 X1 1.8288 0.098 18.616 0.000 1.634 2.024 X2 0.4614 0.094 4.933 0.000 0.276 0.647 ============================================================================== Omnibus: 1.512 Durbin-Watson: 1.820 Prob(Omnibus): 0.469 Jarque-Bera (JB): 1.434 Skew: 0.180 Prob(JB): 0.488 Kurtosis: 2.537 Cond. No. 1.23 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
No comments:
Post a Comment