Multiple regression

In this exercise we will use linear regression for developing a proxy. A proxy is a variable that is easier to measure than the one we are actually interested in. For example, we cannot measure the temperature of the Earth’s surface in the past, but we can reconstruct it using proxies: e.g. the oxygen isotopic composition of marine microfossils or the adaptations exhibited by plants, which reflect the functional morphology of how the plant functions in a given temperature regime.

Why is this a good case for using multiple regression? Because no single leaf trait can predict climate perfectly. But if we combine several leaf traits, we might get a better prediction. This is the essence of multiple regression: we are looking for the best combination of predictors.

Reconstructing climate parameters from leaf physiognomy (=shape) data is known as CLAMP (Climate Leaf Analysis Multivariate Program). It is widely used, but when you first encounter it, you may be very suspicious: leaves are of course shaped by many things, how high they are on a tree, how much light they get, how much water they get, etc. And some plants develop differently shaped leaves at different stages of their life cycle or of the season. However, the stake is high: should the method be reliable, we could apply it to fossil environments. There are so called Fossillagerstätten, where we find not one or two fossils, but an entire assemblage with all types of leaves present in the ecosystem. With CLAMP, we could reconstruct the climate precisely, even if these plants are extinct today and we don’t know their ecological preferences. So if we can reliably calibrate a proxy, it will be a powerful tool in palaeoclimatology.

Fortunately, researchers have collected large datasets suitable for such a calibration. We will use one collected by Peppe et al. (2011). In this exercise, we use their table S1, “Climate variables and site-mean physiognomic variables for all calibration sites”

import pandas as pd
import statsmodels.api as sm
import seaborn as sns

Import the dataset from the file extracted from the article:

path = '../Data/Peppe_et_al_2011_S1.csv'
df = pd.read_csv(path, sep=',')
df.head()

This is a variety of data. A complete explanation you can find elsewhere, but let us focus on the most important - and easy to grasp - variables:

Leaf physiognomy variables

  • Percent untoothed - what proportion of taxa has serrated leaves
  • #teeth - number of teeth
  • Leaf area [\(cm^2\)]
  • TA [\(cm^2\)] - tooth area
  • Peri/Area - the ratio of the perimeter to the area [\(cm\)], in other words: the complexity of the leaf perimeter
  • TA/BA - ratio of tooth area to blade area

In fact there are many more variables. Here we just select a few that are easy to grasp. You can learn more about the power of this analysis and its physical and biological underpinning in the Palaeobiology course or in the https://www.digitalatlasofancientlife.org/learn/paleoecology/paleoclimate/dilp/.

Which of the many parameters of leaf shape is a good predictor of climate parameters? We could ask this question for two reasons:

  1. Measuring leaf parameters is labour intensive. We only want to measure variables which are really informative.

  2. The relationship between climate and leaf physiognomy helps us understand the functional morphology of the plant: e.g. how does it exchange water and \(CO_2\) through its leaves? What shapes are optimized for preventing water loss and which ones are optimized for \(CO_2\) acquistion?

Always plot your data

We can use the code from the textbook (11.4) to visualize the variables:

sns.pairplot(df[['MAT (°C)', 'Percent untoothed*', 'TA (cm2)', '#teeth','Leaf area (cm2)','Peri/Area','TA/BA']])

Some of these variables are clearly not linearly related to MAT. And some of them have skewed distributions, so transformations might be necessary.

Visualizing large datasets with this method is difficult. Even less can be seen on 3D plots, so although they are possible, they are not recommended (see Fig. 12.2 in the textbook). Next week you will learn more about dealing with high-dimensional data.

Let’s pretend we’re careless: we haven’t done this visualization, assumed that the variables are ok, and went straight to the regression analysis:

Predicting climate from leaf shape

This is the proxy calibration part. We want to know which leaf shape parameters are good predictors of climate and what the model is.

# Gather predictor values in X
X = df[['Percent untoothed*', 'TA (cm2)','#teeth','Leaf area (cm2)','Peri/Area','TA/BA']]

Let’s say we are interested in Mean Annual Temperature. It turns out we can use exactly the same syntax as for univariate regression.

X = sm.add_constant(X)
model = sm.OLS(df['MAT (°C)'], X).fit()
summary = model.summary()
print(summary)

Questions

TipQuestion 1

What is the decision on the overall F-statistic for the model? From the lecture recall what \(H_0\) is and what the alternative hypothesis is. Can we reject \(H_0\) here?

TipQuestion 2

Which leaf trait is the strongest predictor of MAT?

TipQuestion 3

Which leaf trait(s) can we kick out from the model? Based on what criterion?

TipQuestion 4

Calculate standardized partial regression coefficients.

TipQuestion 5

Plot the residuals. Are they normally distributed?

References

Peppe, D. J., Royer, D. L., Cariglino, B., Oliver, S. Y., Newman, S., Leight, E., … & Enikolopov, G. (2011). Sensitivity of leaf size and shape to climate: global patterns and paleoclimatic applications. New Phytologist, 190(3), 724-739.