Compositional data

======= In many cases in geoscience we have to deal with composition data. The chemical composition of rocks, pebble compositions of glacial tills, sediment compositions of lakes are usually treated as percentages. The same is true for the abundance of fossils identified from different sediment layers. Although they can be counted as absolutes the concentration of fossils in a sediment layer may be difficult to asses or not be meaningful for different reasons. If we want to compare the composition of samples in such situations it is often necessary to calculate percentages: to sum up all the finds of different fossils and express the abundance of individual types as proportions of the sum. This has the disadvantage, that if one amount of one is less frequent in one sample others will become more frequent whether or not they are actually more abundant. We will have a look at that in a hypothetical example:

Let’s assume we have obtained a sediment core from a Dutch Pingo lake and want to study the content of fossil pollen preserved in the sediment that provides information on past vegetation around the site. We quickly processed 6 samples throughout the core to see if the vegetation has changed during the time represented by the core. Only 3 pollen types were counted:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pollen = {'Birch' :[10,10,10,10,10,30], 
          'Pine' : [25, 20, 15, 10, 5, 1], 
          'Grass' :[5,20,5,5,5,5]}
df = pd.DataFrame(data=pollen)

We see the data in the table but to get a better visual impression let’s use a line plot.

sns.set()
sns.lineplot(data = df[['Birch','Pine', 'Grass']])
plt.title("Pollen content")
plt.ylabel('Number of grains counted')

The absolute count of fossil pollen differed much between the 3 three different pollen types. This makes it difficult to compare their count numbers between samples with different count totals:

df.sum(1)

Let us calculate the percentages and look at them.

df['percBirch'] = 100* df['Birch']/(df['Birch']+df['Pine']+df['Grass'])
df['percPine'] = 100* df['Pine']/(df['Birch']+df['Pine']+df['Grass'])
df['percGrass'] = 100* df['Grass']/(df['Birch']+df['Pine']+df['Grass'])

sns.lineplot(data = df[['percBirch','percPine', 'percGrass']])
plt.title("Pollen percent")
plt.ylabel('Pollen content in %')

Question 1

Which changes do you see between the counts and the percentages? Briefly describe them in words.Did the rank order abundance change in any of the samples?

We can quickly generate some x-y scatter plots to investigate relationships between the variables.

sns.pairplot(df,
                 x_vars = ['percBirch','percPine','percGrass'],
                 y_vars = ['percBirch','percPine','percGrass'],
                 height=2, aspect=1)

Question 2

There appear to be some correlations between the variables. Are they real? Create these scatter plots also for the initial count data.

These spurious correlations and a number of other effects are due to the “closure” of the data. Closed data or closure means that all observations sum up to a value ‘κ’ and all single proportions fall between 0 and ‘κ’. The individual proportions are generally not independent from another requiring transformations. Although the closure effect has consequences also for multivariate analysis, we will here not take measures against it, but need to be aware of it.

Assessing differences between multivariate samples

We stay with the above example data we created for a moment longer and will quantify the differences between the different samples. As each sample is composed of three variables we cannot just compare them like that but need a distance measure. But first we will create a clean percentage data set that only contains the percentage data (and not the original count data) and use this for our analysis:

pol_perc = df[['percBirch','percPine','percGrass']]

Question 3

Do you remember the equation for the Euclidean distance?

It is the squared difference between the pairs of variables in the two samples to be compared and the square root of their sums. Please calculate the Euclidean distance between sample 0 and sample 1 in PYTHON in small steps and provide the code.

HINT:You may want to use the pandas.DataFrame.diff function and the answer should be 35.88175.

So hopefully you just calculated the distance between two samples. To get the full picture you would need to repeat that calculation for all pairs of samples which would be a lot of work and of cause there is a function in Python that can do that for you.

import scipy.spatial.distance as distance

d = distance.pdist(pol_perc,'euclidean')
d = distance.squareform(d)
d

Question 4

Explore the resulting distance matrix and find which two samples are most similar and which sample is most different from the other samples.

Question 5

Explore other available distances in scipy.spatial.distance and list at least 3 that find a largest distance for a different pair of samples compared to the euclidean distance. Find at least 3 distances that are not informative in the current case and state why.

Question 6

You familiarized yourself with the different distance measures. Lets come back to the Euclidean distance and work out the linkage. Open the distance matrix again and draw by hand a dendrogram with the linkage distances on the y-axis. Provide the order of samples that you linked.

Now compute the linkages and look at the results:

from scipy.cluster.hierarchy import linkage
Z = linkage(pol_perc,'single')
Z

Look at the result! Columns one and two tell us which two samples were linked and Column three gives the distance for that linkage. In the case of ‘single linkage’ these distances are usually the same as those in the distance matrix that you look at before. We know that samples 2 and 3 are most similar and see that they were linked with their distance. The combined sample is given the next available new label (6). The next link is made between this new group (6) and sample 0, which is the distance between sample 0 and sample 2.

You can now of cause also plot the dendrogram and check if your drawing was OK.

from scipy.cluster.hierarchy import dendrogram
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.show()

Question 7

Now compute and plot a dendrogram using the complete linkage and describe the differences. What is indicated by the 4th column of the linkage table Z? Why are the values of the distance (y-axis) different compared to the results obtained with single linkage?