remove_non_significant_variables

Performing model selection on predictors using a t-test involves determining whether each predictor variable in a regression model has a statistically significant effect on the dependent variable. If a predictor does not significantly contribute to explaining the variation in the target variable, it may be removed from the model to improve efficiency and interpretability.

Steps to Perform Model Selection Using a t-Test

Fit a Regression Model

Choose a regression model, such as linear regression.
Fit the model using all available predictor variables.

Extract t-Statistics and p-Values

The t-test evaluates the null hypothesis that a predictor’s coefficient is zero (meaning the predictor has no effect).
Compute the t-statistic:

$$t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}$$

where:

$$\hat{\beta}_j$$ is the estimated coefficient for predictor j,
$$SE(\hat{\beta}_j)$$ is the standard error of the coefficient.

Assess Statistical Significance

The p-value from the t-test indicates the probability of observing the coefficient given that the predictor has no actual effect.
Typically, predictors with p-values greater than 0.05 (5%) are considered statistically insignificant.

Eliminate Insignificant Predictors

If a predictor’s p-value is above a chosen threshold (e.g., 0.05), remove it from the model.
Refit the model with the remaining predictors.

Repeat Until All Predictors are Significant

Continue the process iteratively until only significant predictors remain.

In the following example, we removed predictors X3 based on its p value 0.748. We choose statsmodels.api.OLS(y, X).fit() instead of sklearn.linear_model.LinearRegression().fit(X, y) because sklearn doesn't directly provide p-values or confidence intervals like statsmodels.

In [5]:

import pandas as pd
import numpy as np
import statsmodels.api as sm

# Generate synthetic data
np.random.seed(42)
n = 100
X1 = np.random.randn(n)
X2 = np.random.randn(n)
X3 = np.random.randn(n)
y = 3 + 2 * X1 + 0.5 * X2 + np.random.randn(n)  # X3 is irrelevant

# Create DataFrame
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'y': y})

# Add constant term for intercept
X = sm.add_constant(df[['X1', 'X2', 'X3']])
y = df['y']

# Fit the regression model
model = sm.OLS(y, X).fit()

# Display summary (includes t-test results)
print(model.summary())

# Select predictors based on p-values
significant_vars = model.pvalues[model.pvalues < 0.05].index.tolist()

# Remove non-significant variables and refit model
if 'const' in significant_vars:
    significant_vars.remove('const')  # Keep the intercept
X_new = sm.add_constant(df[significant_vars])
model_new = sm.OLS(y, X_new).fit()

# Display updated model summary, notice X3 is removed
print(model_new.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.784
Model:                            OLS   Adj. R-squared:                  0.778
Method:                 Least Squares   F-statistic:                     116.4
Date:                Mon, 17 Mar 2025   Prob (F-statistic):           7.26e-32
Time:                        03:57:07   Log-Likelihood:                -127.46
No. Observations:                 100   AIC:                             262.9
Df Residuals:                      96   BIC:                             273.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0875      0.089     34.599      0.000       2.910       3.265
X1             1.8227      0.100     18.140      0.000       1.623       2.022
X2             0.4618      0.094      4.913      0.000       0.275       0.648
X3             0.0269      0.083      0.322      0.748      -0.139       0.193
==============================================================================
Omnibus:                        1.353   Durbin-Watson:                   1.821
Prob(Omnibus):                  0.508   Jarque-Bera (JB):                1.317
Skew:                           0.169   Prob(JB):                        0.518
Kurtosis:                       2.551   Cond. No.                         1.37
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.784
Model:                            OLS   Adj. R-squared:                  0.780
Method:                 Least Squares   F-statistic:                     176.2
Date:                Mon, 17 Mar 2025   Prob (F-statistic):           5.08e-33
Time:                        03:57:07   Log-Likelihood:                -127.52
No. Observations:                 100   AIC:                             261.0
Df Residuals:                      97   BIC:                             268.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0899      0.089     34.907      0.000       2.914       3.266
X1             1.8288      0.098     18.616      0.000       1.634       2.024
X2             0.4614      0.094      4.933      0.000       0.276       0.647
==============================================================================
Omnibus:                        1.512   Durbin-Watson:                   1.820
Prob(Omnibus):                  0.469   Jarque-Bera (JB):                1.434
Skew:                           0.180   Prob(JB):                        0.488
Kurtosis:                       2.537   Cond. No.                         1.23
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

XYZ CODE

How to Remove Non-Significant Variables Using a t-Test in Regression | Python Guide

No comments:

Post a Comment