import pandas as pd
import statsmodels.api as sm
import seaborn as snsMultiple regression
In this exercise we will use linear regression for developing a proxy. A proxy is a variable that is easier to measure than the one we are actually interested in. For example, we cannot measure the temperature of the Earth’s surface in the past, but we can reconstruct it using proxies: e.g. the oxygen isotopic composition of marine microfossils or the adaptations exhibited by plants, which reflect the functional morphology of how the plant functions in a given temperature regime.
Why is this a good case for using multiple regression? Because no single leaf trait can predict climate perfectly. But if we combine several leaf traits, we might get a better prediction. This is the essence of multiple regression: we are looking for the best combination of predictors.
Reconstructing climate parameters from leaf physiognomy (=shape) data is known as CLAMP (Climate Leaf Analysis Multivariate Program). It is widely used, but when you first encounter it, you may be very suspicious: leaves are of course shaped by many things, how high they are on a tree, how much light they get, how much water they get, etc. And some plants develop differently shaped leaves at different stages of their life cycle or of the season. However, the stake is high: should the method be reliable, we could apply it to fossil environments. There are so called Fossillagerstätten, where we find not one or two fossils, but an entire assemblage with all types of leaves present in the ecosystem. With CLAMP, we could reconstruct the climate precisely, even if these plants are extinct today and we don’t know their ecological preferences. So if we can reliably calibrate a proxy, it will be a powerful tool in palaeoclimatology.
Fortunately, researchers have collected large datasets suitable for such a calibration. We will use one collected by Peppe et al. (2011). In this exercise, we use their table S1, “Climate variables and site-mean physiognomic variables for all calibration sites”
Import the dataset from the file extracted from the article:
path = '../Data/Peppe_et_al_2011_S1.csv'
df = pd.read_csv(path, sep=',')
df.head()This is a variety of data. A complete explanation you can find elsewhere, but let us focus on the most important - and easy to grasp - variables:
Leaf physiognomy variables
- Percent untoothed - what proportion of taxa has serrated leaves
- #teeth - number of teeth
- Leaf area [\(cm^2\)]
- TA [\(cm^2\)] - tooth area
- Peri/Area - the ratio of the perimeter to the area [\(cm\)], in other words: the complexity of the leaf perimeter
- TA/BA - ratio of tooth area to blade area
In fact there are many more variables. Here we just select a few that are easy to grasp. You can learn more about the power of this analysis and its physical and biological underpinning in the Palaeobiology course or in the https://www.digitalatlasofancientlife.org/learn/paleoecology/paleoclimate/dilp/.
Which of the many parameters of leaf shape is a good predictor of climate parameters? We could ask this question for two reasons:
Measuring leaf parameters is labour intensive. We only want to measure variables which are really informative.
The relationship between climate and leaf physiognomy helps us understand the functional morphology of the plant: e.g. how does it exchange water and \(CO_2\) through its leaves? What shapes are optimized for preventing water loss and which ones are optimized for \(CO_2\) acquistion?
Always plot your data
We can use the code from the textbook (11.4) to visualize the variables:
sns.pairplot(df[['MAT (°C)', 'Percent untoothed*', 'TA (cm2)', '#teeth','Leaf area (cm2)','Peri/Area','TA/BA']])Some of these variables are clearly not linearly related to MAT. And some of them have skewed distributions, so transformations might be necessary.
Visualizing large datasets with this method is difficult. Even less can be seen on 3D plots, so although they are possible, they are not recommended (see Fig. 12.2 in the textbook). Next week you will learn more about dealing with high-dimensional data.
Let’s pretend we’re careless: we haven’t done this visualization, assumed that the variables are ok, and went straight to the regression analysis:
Predicting climate from leaf shape
This is the proxy calibration part. We want to know which leaf shape parameters are good predictors of climate and what the model is.
# Gather predictor values in X
X = df[['Percent untoothed*', 'TA (cm2)','#teeth','Leaf area (cm2)','Peri/Area','TA/BA']]Let’s say we are interested in Mean Annual Temperature. It turns out we can use exactly the same syntax as for univariate regression.
X = sm.add_constant(X)
model = sm.OLS(df['MAT (°C)'], X).fit()
summary = model.summary()
print(summary)Questions
What is the decision on the overall F-statistic for the model? From the lecture recall what \(H_0\) is and what the alternative hypothesis is. Can we reject \(H_0\) here?
Solution
Q1: From the lecture:
- Null hypothesis (\(H_0\)): \(\beta_1 = \beta_2 = \dots = \beta_m = 0\)
- Alternative hypothesis (\(H_1\)): \(\beta_j \neq 0\) for at least one value of j
With the coefficients (\(\beta_j\)) not being equal to 0, and because the p-value for the overall F-statistic is extremely small (below \(10^{-20}\)), we can reject the null hypothesis.
Which leaf trait is the strongest predictor of MAT?
Solution
Q2:
To identify the strongest predictor of Mean Annual Temperature (MAT), we need to examine the coefficients and p-values of the leaf traits in the regression model. The strongest predictor is typically the one that:
- Is statistically significant (small p-value, typically < 0.05).
- Has the largest absolute coefficient, indicating a strong effect on the dependent variable (MAT).
In this case, Peri/Area is the strongest predictor of MAT because it is highly statistically significant (p < 0.001) and has the largest absolute coefficient (approximately -1.2), meaning it has the strongest influence on MAT.
Which leaf trait(s) can we kick out from the model? Based on what criterion?
Solution
Q3:
Based on the p-value criterion, we can remove predictors that have p-values above 0.05, as they are not statistically significant. In this case, TA (cm²) and Leaf area (cm²) have p-values greater than 0.05, suggesting they can be removed from the model.
Calculate standardized partial regression coefficients.
Solution
Q4:
To calculate standardized partial regression coefficients, we standardize both the independent and dependent variables before fitting the model. Here’s the Python code to do that:
X = df[['Percent untoothed*', '#teeth', 'Peri/Area', 'TA/BA']]
Y = df['MAT (°C)']
# Standardize the variables
X_std = (X - X.mean()) / X.std()
y_std = (df['MAT (°C)'] - df['MAT (°C)'].mean()) / df['MAT (°C)'].std()
# Fit the OLS model with the reduced set of independent variables
model_refit = sm.OLS(y_std, X_std).fit()
# Output the summary of the refitted model
print(model_refit.summary())Plot the residuals. Are they normally distributed?
Solution
Q5:
To assess whether the residuals are normally distributed, we can plot a histogram and a Q-Q plot of the residuals.
import matplotlib.pyplot as plt
# Calculate the residuals
residuals = model_refit.resid
# Set up the matplotlib figure
plt.figure(figsize=(14, 6))
# Histogram of residuals
plt.subplot(1, 2, 1)
sns.histplot(residuals, bins=20, kde=True)
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
# Q-Q plot
plt.subplot(1, 2, 2)
sm.qqplot(residuals, line='s', ax=plt.gca())
plt.title('Q-Q Plot of Residuals')
plt.tight_layout()
plt.show()Based on the histogram and the QQ plot we can conclude that the data is normally distributed.
References
Peppe, D. J., Royer, D. L., Cariglino, B., Oliver, S. Y., Newman, S., Leight, E., … & Enikolopov, G. (2011). Sensitivity of leaf size and shape to climate: global patterns and paleoclimatic applications. New Phytologist, 190(3), 724-739.