Performing model selection on predictors using a t-test involves determining whether each predictor variable in a regression model has a statistically significant effect on the dependent variable. If a predictor does not significantly contribute to explaining the variation in the target variable, it may be removed from the model to improve efficiency and interpretability.
Steps to Perform Model Selection Using a t-Test
Fit a Regression Model
- Choose a regression model, such as linear regression.
- Fit the model using all available predictor variables.
Extract t-Statistics and p-Values
- The t-test evaluates the null hypothesis that a predictor’s coefficient is zero (meaning the predictor has no effect).
- Compute the t-statistic:
where:
- $$\hat{\beta}_j$$ is the estimated coefficient for predictor j,
- $$SE(\hat{\beta}_j)$$ is the standard error of the coefficient.
Assess Statistical Significance
- The p-value from the t-test indicates the probability of observing the coefficient given that the predictor has no actual effect.
- Typically, predictors with p-values greater than 0.05 (5%) are considered statistically insignificant.
Eliminate Insignificant Predictors
- If a predictor’s p-value is above a chosen threshold (e.g., 0.05), remove it from the model.
- Refit the model with the remaining predictors.
Repeat Until All Predictors are Significant
- Continue the process iteratively until only significant predictors remain.
In the following example, we removed predictors X3 based on its p value 0.748. We choose statsmodels.api.OLS(y, X).fit() instead of sklearn.linear_model.LinearRegression().fit(X, y) because sklearn doesn't directly provide p-values or confidence intervals like statsmodels.
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Generate synthetic data
np.random.seed(42)
n = 100
X1 = np.random.randn(n)
X2 = np.random.randn(n)
X3 = np.random.randn(n)
y = 3 + 2 * X1 + 0.5 * X2 + np.random.randn(n) # X3 is irrelevant
# Create DataFrame
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'y': y})
# Add constant term for intercept
X = sm.add_constant(df[['X1', 'X2', 'X3']])
y = df['y']
# Fit the regression model
model = sm.OLS(y, X).fit()
# Display summary (includes t-test results)
print(model.summary())
# Select predictors based on p-values
significant_vars = model.pvalues[model.pvalues < 0.05].index.tolist()
# Remove non-significant variables and refit model
if 'const' in significant_vars:
significant_vars.remove('const') # Keep the intercept
X_new = sm.add_constant(df[significant_vars])
model_new = sm.OLS(y, X_new).fit()
# Display updated model summary, notice X3 is removed
print(model_new.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.784
Model: OLS Adj. R-squared: 0.778
Method: Least Squares F-statistic: 116.4
Date: Mon, 17 Mar 2025 Prob (F-statistic): 7.26e-32
Time: 03:57:07 Log-Likelihood: -127.46
No. Observations: 100 AIC: 262.9
Df Residuals: 96 BIC: 273.3
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.0875 0.089 34.599 0.000 2.910 3.265
X1 1.8227 0.100 18.140 0.000 1.623 2.022
X2 0.4618 0.094 4.913 0.000 0.275 0.648
X3 0.0269 0.083 0.322 0.748 -0.139 0.193
==============================================================================
Omnibus: 1.353 Durbin-Watson: 1.821
Prob(Omnibus): 0.508 Jarque-Bera (JB): 1.317
Skew: 0.169 Prob(JB): 0.518
Kurtosis: 2.551 Cond. No. 1.37
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.784
Model: OLS Adj. R-squared: 0.780
Method: Least Squares F-statistic: 176.2
Date: Mon, 17 Mar 2025 Prob (F-statistic): 5.08e-33
Time: 03:57:07 Log-Likelihood: -127.52
No. Observations: 100 AIC: 261.0
Df Residuals: 97 BIC: 268.8
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.0899 0.089 34.907 0.000 2.914 3.266
X1 1.8288 0.098 18.616 0.000 1.634 2.024
X2 0.4614 0.094 4.933 0.000 0.276 0.647
==============================================================================
Omnibus: 1.512 Durbin-Watson: 1.820
Prob(Omnibus): 0.469 Jarque-Bera (JB): 1.434
Skew: 0.180 Prob(JB): 0.488
Kurtosis: 2.537 Cond. No. 1.23
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
No comments:
Post a Comment