Fitting regression models

Transformations of variables

In the first part of this exercise, you have carried out multiple linear regression analysis. Today you will address three final considerations for a good regression model:

  1. Variable transformations that allow fitting linear models to non-linear relationships,
  2. Fitting non-linear models,
  3. Deciding which model is better.

Bring up the results of your analysis on leaf physiognomy and climate variables (e.g. re-run your code from Tuesday) and take another look at the results. They include the following note:

[2] The condition number is large, 6.09e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The fine print at the bottom of the model summary mentions multicollinearity. It seems that multiple predictor variabled are corellated with each other. This just adds insult to the injury: from the pairplot function output, we know some variables have non-linear relationships with the Mean Annual Temperature and need a transformation. So all your conclusions could be wrong 😱

Now make The Best Model: transform variables and if some remain highly correlated with each other, check which one is the best predictor and keep only that one. As a result, you will have a model with fewer predictors and better fit. To assess the fit, you can re-visit the questions you answered last time, but for your new model:

TipQuestion 1
  1. What is the decision on the overall F-statistic for the model? From the lecture recall what \(H_0\) is and what the alternative hypothesis is. Can we reject \(H_0\) here?
TipQuestion 2
  1. Which leaf trait is the strongest predictor of MAT?
TipQuestion 3
  1. Which leaf trait(s) can we kick out from the model? Based on what criterion?
TipQuestion 4
  1. Plot the residuals. Are they normally distributed?

Tip: you might want to investigate how the predictor variables are correlated with each other. If you want to visualize it, you can use the matshow function from the matplotlib library. It is not required to solve this question, but might help you to understand the relationships between the variables.

Solution

From last tutorial we found TA (cm²) and Leaf area (cm²) to have p-values greater than 0.05, suggesting they can be removed from the model. Now we move on with the other predictors. For this we calculate the Variance Inflation Factor (VIF) and check the correlation matrix.

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Select predictors
X = df[['Percent untoothed*', '#teeth', 'Peri/Area', 'TA/BA']]

# Calculate VIF
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

# Check the correlation matrix
correlation_matrix = df[['Percent untoothed*', '#teeth', 'Peri/Area', 'TA/BA']].corr()
print(correlation_matrix)

# Example transformation: Log transformation for TA/BA
df['TA/BA'] = np.log1p(df['TA/BA'])  # Log transformation (log(x + 1))

# Recalculate VIF after any transformations
X_transformed = df[['Percent untoothed*', '#teeth', 'Peri/Area', 'TA/BA']]
vif_data_transformed = pd.DataFrame()
vif_data_transformed['Feature'] = X_transformed.columns
vif_data_transformed['VIF'] = [variance_inflation_factor(X_transformed.values, i) for i in range(X_transformed.shape[1])]

print(vif_data_transformed)

# Check the correlation matrix
correlation_matrix = df[['Percent untoothed*', '#teeth', 'Peri/Area', 'TA/BA']].corr()
print(correlation_matrix)

#Bonus code: visualizing the correlation matrix
import matplotlib.pyplot as plt
p = plt.figure()
plt.matshow(correlation_matrix)
plt.xticks(range(correlation_matrix.shape[1]), correlation_matrix.columns, rotation=45)
plt.yticks(range(correlation_matrix.shape[1]), correlation_matrix.columns)
cb = plt.colorbar()

We see the TA/BA feature has a VIF of 5.73, which is relatively high compared to the others. This indicates that it might be somewhat correlated with other predictors. TA/BA also shows high correlation with other predictors, so we transform it.

The VIF of TA/BA remains high. Drop it and make the final model:

# Drop 'TA/BA' from the predictor variables
X = df[['Percent untoothed*', '#teeth', 'Peri/Area']]
Y = df['MAT (°C)']

# Standardize the variables
X_std = (X - X.mean()) / X.std()
y_std = (df['MAT (°C)'] - df['MAT (°C)'].mean()) / df['MAT (°C)'].std()

# Fit the OLS model with the reduced set of independent variables
model_final = sm.OLS(y_std, X_std).fit()

# Output the summary of the new model
print(model_final.summary())
  • Null hypothesis (\(H_0\)): \(\beta_1 = \beta_2 = \dots = \beta_m = 0\)
  • Alternative hypothesis (\(H_1\)): \(\beta_j \neq 0\) for at least one value of j

With the coefficients (\(\beta_j\)) not being equal to 0, and because the p-value for the overall F-statistic is extremely small (\(1.66 \times 10^{-23}\)), we can reject the null hypothesis.

To identify the strongest predictor of Mean Annual Temperature (MAT), we need to examine the coefficients and p-values of the leaf traits in the regression model. The strongest predictor is typically the one that:

  • Is statistically significant (small p-value, typically < 0.05).
  • Has the largest absolute coefficient, indicating a strong effect on the dependent variable (MAT).

In this case, Percent untoothed is the strongest predictor of MAT because it is highly statistically significant (p < 0.001) and has the largest absolute coefficient (0.8152), meaning it has the strongest influence on MAT.

Based on the p-value criterion, we can remove predictors that have p-values above 0.05, as they are not statistically significant. In this case, #teeth has p-values greater than 0.05, suggesting they can be removed from the model.

# Drop '#teeth' from the predictor variables
X = df[['Percent untoothed*','Peri/Area']]
Y = df['MAT (°C)']

# Standardize the variables
X_std = (X - X.mean()) / X.std()
y_std = (df['MAT (°C)'] - df['MAT (°C)'].mean()) / df['MAT (°C)'].std()

# Fit the OLS model with the reduced set of independent variables
model_final = sm.OLS(y_std, X_std).fit()

# Output the summary of the new model
print(model_final.summary())

# Calculate the residuals
residuals = model_final.resid

# Set up the matplotlib figure
plt.figure(figsize=(14, 6))

# Histogram of residuals
plt.subplot(1, 2, 1)
sns.histplot(residuals, bins=20, kde=True)
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')

# Q-Q plot
plt.subplot(1, 2, 2)
sm.qqplot(residuals, line='s', ax=plt.gca())
plt.title('Q-Q Plot of Residuals')

plt.tight_layout()
plt.show()

Which model is better?

In the textbook (12.3.2) and the lecture, you encountered the Akaike Information Criterion. It is a measure of the relative quality of a statistical model for a given set of data. Relative is very important here, because you cannot compare models that have a different response variable. But you can compare e.g. a model with one or several predictor variables or a linear model with a non-linear model to see which one is better.

TipQuestion 5

What was the AIC value for your original model? And what is the AIC for your model after transforming and eliminating variables? Which model has a lower AIC and is thus a better model?

Solution
original_model_aic = model_refit.aic  # Replace 'previous_model' with your actual model variable
new_model_aic = model_final.aic

# Print both AIC values
print(f"Original Model AIC: {original_model_aic}")
print(f"New Model AIC: {new_model_aic}")

# Compare the models based on AIC
if new_model_aic < original_model_aic:
    print("The new model is better (lower AIC).")
else:
    print("The original model is better (lower AIC).")  

Choosing between a linear and non-linear model

Linear regression models are the easiest to fit, so transforming variables is often the best strategy. But sometimes the variables are defined in such a way, that it is clear that they must follow a non-linear relationship based on physics or geometry, depending on how you look at it. For example, leaf circumference and area scale differently.

Let us assume that a leaf can be approximated with a circle. Then the perimeter (and thus the number of toothlets) scales as \(2 \pi r\) and the area as \(\pi r^2\). So if the predicted variable, such as MAT, is linearly dependent on one of them, it cannot be also linearly dependent on the other one. Things get much worse if you approximate a leaf with an ellipse, which is more accurate, but the calculations become much more complicated and we’ll skip them here (you can try at home if you’re curious).

TipQuestion 6

Transform the leaf area in the calibration dataset using the square root transformation and a natural logarithm. Which transformation gives a better linear relationship with MAT visually? And which one is best using the fit criteria you’ve learned so far?

import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
ax.plot(df['Leaf area (cm2)'], df['MAT (°C)'], '*')
ax.set_ylabel('MAT (°C)')
ax.set_xlabel('Leaf area (cm2)')

fig, ax = plt.subplots()
ax.plot(np.log(df['Leaf area (cm2)']), df['MAT (°C)'], 'o')
ax.set_ylabel('MAT (°C)')
ax.set_xlabel('Log of leaf area (cm2)')

fig, ax = plt.subplots()
ax.plot(np.sqrt(df['Leaf area (cm2)']), df['MAT (°C)'], 'o')
ax.set_ylabel('MAT (°C)')
ax.set_xlabel('Square root of leaf area (cm2)')
X = sm.add_constant(df['Leaf area (cm2)'])
Null_model = sm.OLS(df['MAT (°C)'], X).fit()
Log_model = sm.OLS(df['MAT (°C)'], np.log(X)).fit()
Sqrt_model = sm.OLS(df['MAT (°C)'], np.sqrt(X)).fit()
print(Null_model.aic)
print(Log_model.aic)
print(Sqrt_model.aic)

If you dig deeper, you will discover that AIC - like many other estimated variables - is biased when calculated for small sample sizes. You can obtain a corrected value, AICc, using statsmodels. See the manual for details. The values of the parameters can be read from the summary table of each model.

from statsmodels.tools.eval_measures import aicc
print(f'AICc for the model without transformations: {aicc(llf = Null_model.llf, nobs = Null_model.nobs, df_modelwc = 2)}')
from statsmodels.tools.eval_measures import aicc
print(f'AICc for the model with a log-transformation: {aicc(llf = Log_model.llf, nobs = Log_model.nobs, df_modelwc = 2)}')
print(f'AICc for the model with a root transformation: {aicc(llf = Sqrt_model.llf, nobs = Sqrt_model.nobs, df_modelwc = 2)}')

How does it compare with \(\bar{R}^2\)?

print(f'R2adj for the model without transformations: {Null_model.rsquared_adj}')
from statsmodels.tools.eval_measures import aicc
print(f'R2adj for the model with a log-transformation: {Log_model.rsquared_adj}')
print(f'R2adj for the model with a root transformation: {Sqrt_model.rsquared_adj}')
fig, ax = plt.subplots()
ax.plot(df['Leaf area (cm2)'], df['MAT (°C)'], '*')
ax.plot(df['Leaf area (cm2)'], Null_model.fittedvalues, '-', color='#00000080')
ax.set_xlabel('Leaf area (cm^2)')
ax.set_ylabel('MAT (°C)')
fig, ax = plt.subplots()
ax.plot(np.log(df['Leaf area (cm2)']), df['MAT (°C)'], 'o')
ax.plot(np.log(df['Leaf area (cm2)']), Log_model.fittedvalues, '-', color='#00000080')
ax.set_xlabel('Log Leaf area (cm^2)')
ax.set_ylabel('MAT (°C)')
fig, ax = plt.subplots()
ax.plot(np.sqrt(df['Leaf area (cm2)']), df['MAT (°C)'], 'o')
ax.plot(np.sqrt(df['Leaf area (cm2)']), Sqrt_model.fittedvalues, '-', color='#00000080')
ax.set_xlabel('Sqrt Leaf area (cm^2)')
ax.set_ylabel('MAT (°C)')

Why do the AIC and \(\bar{R}^2\) indicate different best models?

Sometimes we are not happy with the regression model obtained using transformations. In other cases, we want to predict values outside of those represented in our training dataset. For example, very hot temperatures that might have occurred in the past but are not represented in the leaf assemblages in our calibration dataset. In such cases, we might be truly interested in a non-linear model.

The details of computation follow the convention in the textbook, chapter 12.3.2. statsmodels can fit a non-linear model as long as the parameters are linear. If you want to know more, please see the example in the manual.

M_sqrt = np.column_stack((np.ones_like(df['Leaf area (cm2)']), df['Leaf area (cm2)'], np.sqrt(df['Leaf area (cm2)'])) )
Sqrt_nonlinear = sm.OLS(df['MAT (°C)'], M_sqrt).fit()
fig, ax = plt.subplots()
ax.plot(M_sqrt[:,2], df['MAT (°C)'], 'o')
ax.plot(M_sqrt[:,2], Sqrt_nonlinear.fittedvalues, '*', color='#00000080')
ax.set_xlabel('Sqrt Leaf area (cm^2)')
ax.set_ylabel('MAT (°C)')
M_log = np.column_stack((np.ones_like(df['Leaf area (cm2)']), df['Leaf area (cm2)'], np.log(df['Leaf area (cm2)'])) )
Log_nonlinear = sm.OLS(df['MAT (°C)'], M_log).fit()
fig, ax = plt.subplots()
ax.plot(M_log[:,2], df['MAT (°C)'], 'o')
ax.plot(M_log[:,2], Log_nonlinear.fittedvalues, '*', color='#00000080')
ax.set_xlabel('Log Leaf area (cm^2)')
ax.set_ylabel('MAT (°C)')

Model uncertainty

Remember that what you are doing in this exercise is developing a proxy. Your model will be applied to fossil leaves to reconstruct Mean Annual Temperature in the past to fuel palaeoclimate models.

In practical 10, you have learned how to calculate the confidence intervals (CI) and belt for the model. Look back into your code and apply it to your new model. First get the confidence intervals for your predictor variables. Then calculate the prediction interval.

TipQuestion 7

Using confidence intervals, which predictor variables do not have a significant effect on MAT?

Solution

Both predictor variables Percent untoothed and Peri/Area have confidence intervals that do not include zero. This suggests that both predictors have a statistically significant effect on Mean Annual Temperature (MAT) based on the model.

The predictive power of a regression model

Congratulations, you have just developed a palaeoclimate proxy. Now you can reconstruct mean annual temperature for fossil sites, where we have leaf assemblages, but there was no thermometer. The same article, from which your calibration dataset comes from, includes also data from five fossil sites, for which leaf parameters have been collected. You can use the model you created (i.e. the best model you selected) to reconstruct the Mean Annual Temperature for each of this sites.

Read the dataset in:

pathS6 = '../Data/Peppe_et_al_2011_S6.csv'
S6 = pd.read_csv(pathS6, sep=',')

Unfortunately, some variables have slightly different names. You can practice your panda-fu to unify the names.

TipQuestion 8

What was the mean annual temperature and its uncertainty for each of the fossil sites? What measures of uncertainty would you need to report when describing this reconstructions?

Solution
# Rename columns to match your regression model
name_mapping = {
    'Percent untoothed': 'Percent untoothed*',  # Renaming for consistency
    '#teeth': '#teeth',  # Keeping the same
    'Tooth area (cm2)': 'Peri/Area',  # Rename to match your model
    # Add any other necessary renaming
}

# Apply renaming
S6.rename(columns=name_mapping, inplace=True)

# Standardize the predictor variables
X = df[['Percent untoothed*','Peri/Area']]
X_std = (X - X.mean()) / X.std()

# If you also have the target variable (MAT) and you want to standardize it
# Assuming 'MAT (°C)' is in the original dataset, replace it with the actual column name
# y = df['MAT (°C)']  # This would be from your original dataset, not S6
# Y_std = (y - y.mean()) / y.std()  # Standardizing target variable

# Now, X_std can be used to make predictions with the model
# Assuming you have a trained model called model_refit
predictions = model_final.get_prediction(X_std)

# Create a summary frame for the predictions
predictions_summary = predictions.summary_frame(alpha=0.05)  # 95% confidence intervals

# Add predictions and confidence intervals back to the S6 DataFrame
S6['Predicted_MAT'] = predictions_summary['mean']
S6['Lower_CI'] = predictions_summary['mean_ci_lower']
S6['Upper_CI'] = predictions_summary['mean_ci_upper']

# Print the results for the fossil sites
for index, row in S6.iterrows():
    print(f"Fossil Site {index + 1}:")
    print(f"  Predicted Mean Annual Temperature: {row['Predicted_MAT']:.2f} °C")
    print(f"  95% Confidence Interval: ({row['Lower_CI']:.2f} °C, {row['Upper_CI']:.2f} °C)\n")