import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pollen = {'Birch' :[10,10,10,10,10,30],
'Pine' : [25, 20, 15, 10, 5, 1],
'Grass' :[5,20,5,5,5,5]}
df = pd.DataFrame(data=pollen)Compositional data
======= In many cases in geoscience we have to deal with composition data. The chemical composition of rocks, pebble compositions of glacial tills, sediment compositions of lakes are usually treated as percentages. The same is true for the abundance of fossils identified from different sediment layers. Although they can be counted as absolutes the concentration of fossils in a sediment layer may be difficult to asses or not be meaningful for different reasons. If we want to compare the composition of samples in such situations it is often necessary to calculate percentages: to sum up all the finds of different fossils and express the abundance of individual types as proportions of the sum. This has the disadvantage, that if one amount of one is less frequent in one sample others will become more frequent whether or not they are actually more abundant. We will have a look at that in a hypothetical example:
Let’s assume we have obtained a sediment core from a Dutch Pingo lake and want to study the content of fossil pollen preserved in the sediment that provides information on past vegetation around the site. We quickly processed 6 samples throughout the core to see if the vegetation has changed during the time represented by the core. Only 3 pollen types were counted:
We see the data in the table but to get a better visual impression let’s use a line plot.
sns.set()
sns.lineplot(data = df[['Birch','Pine', 'Grass']])
plt.title("Pollen content")
plt.ylabel('Number of grains counted')The absolute count of fossil pollen differed much between the 3 three different pollen types. This makes it difficult to compare their count numbers between samples with different count totals:
df.sum(1)Let us calculate the percentages and look at them.
df['percBirch'] = 100* df['Birch']/(df['Birch']+df['Pine']+df['Grass'])
df['percPine'] = 100* df['Pine']/(df['Birch']+df['Pine']+df['Grass'])
df['percGrass'] = 100* df['Grass']/(df['Birch']+df['Pine']+df['Grass'])
sns.lineplot(data = df[['percBirch','percPine', 'percGrass']])
plt.title("Pollen percent")
plt.ylabel('Pollen content in %')Which changes do you see between the counts and the percentages? Briefly describe them in words.Did the rank order abundance change in any of the samples?
Solution
The general declining trend of Pine in the counts is also visible in the percentages, as well as the peak of Grass in sample 1. Equal values for Birch in samples 0 to 4 have changed and rise between samples 1 and 4. The rank order abundance in each sample did not change meaning the same relative position of each variable per sample has not changed.
We can quickly generate some x-y scatter plots to investigate relationships between the variables.
sns.pairplot(df,
x_vars = ['percBirch','percPine','percGrass'],
y_vars = ['percBirch','percPine','percGrass'],
height=2, aspect=1)There appear to be some correlations between the variables. Are they real? Create these scatter plots also for the initial count data.
Solution
sns.pairplot(df,
x_vars = ['Birch','Pine','Grass'],
y_vars = ['Birch','Pine','Grass'],
height=2, aspect=1)In a real situation where independent concentrations are not available it would be difficult to answer the question whether the apparent correlations in the percentage data are real. In this example however, we see that even when there are no correlations in the counts, the percentage representation of the changes in abundance will introduce apparent relationships between the variables. Due to the closed sum, the decline of one variable will need to be meet by an increase in another variable even if the absolute quantity of that variable did not change (e.g. Pine in sample 0 to 4).
These spurious correlations and a number of other effects are due to the “closure” of the data. Closed data or closure means that all observations sum up to a value ‘κ’ and all single proportions fall between 0 and ‘κ’. The individual proportions are generally not independent from another requiring transformations. Although the closure effect has consequences also for multivariate analysis, we will here not take measures against it, but need to be aware of it.
Assessing differences between multivariate samples
We stay with the above example data we created for a moment longer and will quantify the differences between the different samples. As each sample is composed of three variables we cannot just compare them like that but need a distance measure. But first we will create a clean percentage data set that only contains the percentage data (and not the original count data) and use this for our analysis:
pol_perc = df[['percBirch','percPine','percGrass']]Do you remember the equation for the Euclidean distance?
It is the squared difference between the pairs of variables in the two samples to be compared and the square root of their sums. Please calculate the Euclidean distance between sample 0 and sample 1 in PYTHON in small steps and provide the code.
HINT:You may want to use the pandas.DataFrame.diff function and the answer should be 35.88175.
Solution
There are surely different ways of doing this. I choose to first calculate the percentage differences of the different variables between the samples. By using the index I only extract the difference between sample 0 and one into a variable that I call d_12. In a second step I take the square of the numbers. Finally I sum the squares and take the square root of the sum in one step.
d_12 = pol_perc.diff(periods=1, axis=0)[1:2]
d_12 = d_12**2
d_12.sum(1)**0.5So hopefully you just calculated the distance between two samples. To get the full picture you would need to repeat that calculation for all pairs of samples which would be a lot of work and of cause there is a function in Python that can do that for you.
import scipy.spatial.distance as distance
d = distance.pdist(pol_perc,'euclidean')
d = distance.squareform(d)
dExplore the resulting distance matrix and find which two samples are most similar and which sample is most different from the other samples.
Solution
The third (2) and the fourth (3) sample are most similar and the first and last (5) most different.
Explore other available distances in scipy.spatial.distance and list at least 3 that find a largest distance for a different pair of samples compared to the euclidean distance. Find at least 3 distances that are not informative in the current case and state why.
Solution
Among others the braycurtis, canberra and cityblock distances find the largest difference between the second (1) and the last (5) sample. Distances that evaluate presence/absence or Boolean data type are not informative in our example, as our observations consist of abundance changes. Among others hamming, jaccard or matching are not useful here.
You familiarized yourself with the different distance measures. Lets come back to the Euclidean distance and work out the linkage. Open the distance matrix again and draw by hand a dendrogram with the linkage distances on the y-axis. Provide the order of samples that you linked.
Solution
As seen before samples with the ID 2 and 3 are linked first followed by 0, 4, 1, and 5.
Now compute the linkages and look at the results:
from scipy.cluster.hierarchy import linkage
Z = linkage(pol_perc,'single')
ZLook at the result! Columns one and two tell us which two samples were linked and Column three gives the distance for that linkage. In the case of ‘single linkage’ these distances are usually the same as those in the distance matrix that you look at before. We know that samples 2 and 3 are most similar and see that they were linked with their distance. The combined sample is given the next available new label (6). The next link is made between this new group (6) and sample 0, which is the distance between sample 0 and sample 2.
You can now of cause also plot the dendrogram and check if your drawing was OK.
from scipy.cluster.hierarchy import dendrogram
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.show()Now compute and plot a dendrogram using the complete linkage and describe the differences. What is indicated by the 4th column of the linkage table Z? Why are the values of the distance (y-axis) different compared to the results obtained with single linkage?
Solution
The complete linkage is using the largest distance of a sample in a group to another sample or sample within a group. The two samples that are most similar are grouped in the same way, however, now the distance between sample 4 and 5 is smaller than the largest distance from the emerging cluster to sample 5. As the algorithm is looking at the largest distance the respective linking distances on the y-axis are much larger. The 4th column of the linkage table Z indicates the number of samples in one cluster.
Z = linkage(d,'complete')
Z
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.show()