Fitting regression models

Transformations of variables

In the first part of this exercise, you have carried out multiple linear regression analysis. Today you will address three final considerations for a good regression model:

Variable transformations that allow fitting linear models to non-linear relationships,
Fitting non-linear models,
Deciding which model is better.

Bring up the results of your analysis on leaf physiognomy and climate variables (e.g. re-run your code from Tuesday) and take another look at the results. They include the following note:

[2] The condition number is large, 6.09e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The fine print at the bottom of the model summary mentions multicollinearity. It seems that multiple predictor variabled are corellated with each other. This just adds insult to the injury: from the pairplot function output, we know some variables have non-linear relationships with the Mean Annual Temperature and need a transformation. So all your conclusions could be wrong 😱

Now make The Best Model: transform variables and if some remain highly correlated with each other, check which one is the best predictor and keep only that one. As a result, you will have a model with fewer predictors and better fit. To assess the fit, you can re-visit the questions you answered last time, but for your new model:

Question 1

What is the decision on the overall F-statistic for the model? From the lecture recall what \(H_0\) is and what the alternative hypothesis is. Can we reject \(H_0\) here?

Question 2

Which leaf trait is the strongest predictor of MAT?

Question 3

Which leaf trait(s) can we kick out from the model? Based on what criterion?

Question 4

Plot the residuals. Are they normally distributed?

Tip: you might want to investigate how the predictor variables are correlated with each other. If you want to visualize it, you can use the matshow function from the matplotlib library. It is not required to solve this question, but might help you to understand the relationships between the variables.

Which model is better?

In the textbook (12.3.2) and the lecture, you encountered the Akaike Information Criterion. It is a measure of the relative quality of a statistical model for a given set of data. Relative is very important here, because you cannot compare models that have a different response variable. But you can compare e.g. a model with one or several predictor variables or a linear model with a non-linear model to see which one is better.

Question 5

What was the AIC value for your original model? And what is the AIC for your model after transforming and eliminating variables? Which model has a lower AIC and is thus a better model?

Choosing between a linear and non-linear model

Linear regression models are the easiest to fit, so transforming variables is often the best strategy. But sometimes the variables are defined in such a way, that it is clear that they must follow a non-linear relationship based on physics or geometry, depending on how you look at it. For example, leaf circumference and area scale differently.

Let us assume that a leaf can be approximated with a circle. Then the perimeter (and thus the number of toothlets) scales as \(2 \pi r\) and the area as \(\pi r^2\). So if the predicted variable, such as MAT, is linearly dependent on one of them, it cannot be also linearly dependent on the other one. Things get much worse if you approximate a leaf with an ellipse, which is more accurate, but the calculations become much more complicated and we’ll skip them here (you can try at home if you’re curious).

Question 6

Transform the leaf area in the calibration dataset using the square root transformation and a natural logarithm. Which transformation gives a better linear relationship with MAT visually? And which one is best using the fit criteria you’ve learned so far?

Sometimes we are not happy with the regression model obtained using transformations. In other cases, we want to predict values outside of those represented in our training dataset. For example, very hot temperatures that might have occurred in the past but are not represented in the leaf assemblages in our calibration dataset. In such cases, we might be truly interested in a non-linear model.

The details of computation follow the convention in the textbook, chapter 12.3.2. statsmodels can fit a non-linear model as long as the parameters are linear. If you want to know more, please see the example in the manual.

M_sqrt = np.column_stack((np.ones_like(df['Leaf area (cm2)']), df['Leaf area (cm2)'], np.sqrt(df['Leaf area (cm2)'])) )
Sqrt_nonlinear = sm.OLS(df['MAT (°C)'], M_sqrt).fit()
fig, ax = plt.subplots()
ax.plot(M_sqrt[:,2], df['MAT (°C)'], 'o')
ax.plot(M_sqrt[:,2], Sqrt_nonlinear.fittedvalues, '*', color='#00000080')
ax.set_xlabel('Sqrt Leaf area (cm^2)')
ax.set_ylabel('MAT (°C)')

M_log = np.column_stack((np.ones_like(df['Leaf area (cm2)']), df['Leaf area (cm2)'], np.log(df['Leaf area (cm2)'])) )
Log_nonlinear = sm.OLS(df['MAT (°C)'], M_log).fit()
fig, ax = plt.subplots()
ax.plot(M_log[:,2], df['MAT (°C)'], 'o')
ax.plot(M_log[:,2], Log_nonlinear.fittedvalues, '*', color='#00000080')
ax.set_xlabel('Log Leaf area (cm^2)')
ax.set_ylabel('MAT (°C)')

Model uncertainty

Remember that what you are doing in this exercise is developing a proxy. Your model will be applied to fossil leaves to reconstruct Mean Annual Temperature in the past to fuel palaeoclimate models.

In practical 10, you have learned how to calculate the confidence intervals (CI) and belt for the model. Look back into your code and apply it to your new model. First get the confidence intervals for your predictor variables. Then calculate the prediction interval.

Question 7

Using confidence intervals, which predictor variables do not have a significant effect on MAT?

The predictive power of a regression model

Congratulations, you have just developed a palaeoclimate proxy. Now you can reconstruct mean annual temperature for fossil sites, where we have leaf assemblages, but there was no thermometer. The same article, from which your calibration dataset comes from, includes also data from five fossil sites, for which leaf parameters have been collected. You can use the model you created (i.e. the best model you selected) to reconstruct the Mean Annual Temperature for each of this sites.

Read the dataset in:

pathS6 = '../Data/Peppe_et_al_2011_S6.csv'
S6 = pd.read_csv(pathS6, sep=',')

Unfortunately, some variables have slightly different names. You can practice your panda-fu to unify the names.

Question 8

What was the mean annual temperature and its uncertainty for each of the fossil sites? What measures of uncertainty would you need to report when describing this reconstructions?