import pandas as pd
poll_surf = pd.read_csv("../Data/Pol_surf.csv")
sites = pd.read_csv("../Data/Pol_site.csv")Cluster Analysis
Today we will work with real world data exploring cluster analysis of pollen data. If you don’t really know what pollen is have a look at this video or this TedTalk. Pollen grains released by plants mix in the air and are deposited in lakes and bogs were they preserve. Thus, analysing the pollen content from one sample is like a remote sending tool for vegetation cover that works for thousands of years into the past. Modern datasets are needed to study the pollen content of samples in relation to environmental factors like climate. The data that we want to look at is split into two files, with one containing the percentages of the pollen counts (Pol_surf.csv) and the other information on the site location as well as some climate parameters (Pol_site.csv). Reads both datasets into Spyder and have a look at them.
What kind of data do we have in the poll_surf set? Calculate the row sums to investigate.
Solution
To calculate row sums you can use the function .sum() where you need to specify over which axis you want to calculate the sum. The default ‘0’ are column sums so you need to set this to ‘1’ for row sums.
poll_surf.sum(1)You see that the resulting sums are 100 or very close to 100. This means we are looking at percentage data. From seeing some values below 99 we can guess that individual variables (pollen taxa) were removed from some samples without recalculating the percentages.
To get an overview of the data cover we best plot it on a map. For now, we don’t really need all the climate data, but would only like to retain July Temperature and use it for mapping.
import geopandas
import geodatasets
import matplotlib.pyplot as plt
# Get a base map
world = geopandas.read_file(geodatasets.get_path("naturalearth.land"))
# Select variables into a new dataframe
loc = sites.loc[:, ("latitude", "longitude", "Tjuly")]
# Create point geometries
geometry = geopandas.points_from_xy(loc.longitude, loc.latitude)
geo_loc = geopandas.GeoDataFrame(
loc[["latitude", "longitude", "Tjuly"]], geometry=geometry
)
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
geo_loc.plot(column="Tjuly", ax=ax, legend=True, categorical=False, cmap='OrRd', markersize=50)
plt.title("July Temperature")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)To get a feel for cluster analysis it is useful to restrict the dataset both in terms of taxa as well as in the number of samples. The pollen data contains information for one hundred taxa, many of these are rare. So, at a first step we can reduce the taxa to those with an abundance of at least 2%:
poll_c=poll_surf.loc[:,(poll_surf > 2).any()]How many taxa exceed this criterion?
Solution
There are 27 taxa (variables) in the new dataframe ‘poll_c’.
We need to take care when drawing a random sample out of available sites to also get the sample location. As the index is preserved, we can straight away sample the pollen data. We can then extract the index and use it to select the location data. To make sure all of you take the same sample we set the seed.
import numpy as np
np.random.seed(2)
poll_s = poll_c.sample(50)
idx = poll_s.index.values.tolist()
g_loc_s = geo_loc.iloc[idx]Using the above code and the July temperature make a map to look at the samples that you extracted.
Solution
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="Tjuly", ax=ax, legend=True, cmap='OrRd', markersize=50)
plt.title("July Temperature")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)Finally, we got a dataset ready to explore. You already calculated the Euclidean distance between samples, so let us start looking at that distance for this new dataset.
import scipy.spatial.distance as distance
d = distance.pdist(poll_s,'euclidean')
ds = distance.squareform(d)Have a look at the resulting distance matrix. It is rather large so it is useful to compute a heat map.
from matplotlib import cm
fig, ax = plt.subplots()
c = ax.imshow(ds,cmap=cm.Oranges)What does the matrix tell us? What do the dark colours indicate, similarity or dissimilarity? What can you say about sample 30? Find two samples that are most similar.
Solution
This is a mirrored triangular matrix with a diagonal of zero values as these are the distance the sample has to itself. On the x and y-axis we see the sample IDs. Dark colours indicate large differences (high dissimilarity) between samples. Sample number 30 generally differs from other samples. None of the cells indicates a light colour = low dissimilarity.
Agglomerative clustering
Now it is time to find the samples most similar and link them together into clusters. We start using ‘single’ linkage as that is conceptually the simplest.
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(d,'single')Have a look at the output for the 15 first samples.
print(Z[:15,:])It is of cause easier to investigate these links using the dendrogram:
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.xticks(fontsize = 12)
plt.show()Which samples are most dissimilar from the rest?.
Solution
Looking at the dendrogram we find the sample 30 again, which forms a cluster with sample 3 and 12. This cluster has the longest link to all other samples. Samples 10 and 45 also show large linkages to the next samples and these are followed by samples 15 and 28.
This cluster analysis results in a skewed distribution of clusters with one large cluster, one cluster of there and two clusters of one sample. This is not ideal for our problem, but let us still have a look at the map to see where these clusters lie. In order to do that, we first need to cut the dendrogram into the desired number of classes and insert these into our GeoDataFrame:
from scipy import cluster
poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)
g_loc_s.insert(1, "PollClass", poll_class)
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)Where do you find the 5 samples that are most different from all other samples?
Solution
The 5 samples assigned to clusters 1 to 3 are all located in Norway more or less near the west coast.
In the above code you have two parameters that determine the output of your cluster analysis: i) the distance measure and the linkage. Have a look at the available distance measures in pdist as well as the different linkage options and explore different combinations. Among other combinations try ’cityblock’ with ‘ward’! Make dendrograms and maps (at least 2). Look out for the previously identified cluster of 3 most different samples. Is that cluster maintained in other classifications?
Solution
# combination 'cityblock' and 'ward'
d = distance.pdist(poll_s,'cityblock')
Z = linkage(d,'ward')
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.xticks(fontsize = 12)
plt.show()
poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)
g_loc_s["PollClass"] = poll_class
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification using cityblock and ward")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)
# combination 'braycurtis' and 'average'
d = distance.pdist(poll_s,'braycurtis')
Z = linkage(d,'average')
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.xticks(fontsize = 12)
plt.show()
poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)
g_loc_s["PollClass"] = poll_class
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification using braycurtis and average")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)Yes, at lease in the above two combinations we find the previously identified cluster of 3 most different samples back.
Knowing that the Euclidean distance is strongly influenced by absolute differences, these differences can be reduced using a square root transformation.
poll_sq = poll_s.apply(np.sqrt)
d = distance.pdist(poll_sq,'euclidean')
Z = linkage(d,'ward')
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.show()
poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)
g_loc_s["PollClass"] = poll_class
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification using square root transformation")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)We learned that the Chord distance is a Euclidean distance based on normalized samples. Thus although this distance is not provided it can be computed:
from sklearn.preprocessing import normalize
poll_st = normalize(poll_s, axis=1)
d = distance.pdist(poll_st,'euclidean')Compute a dendrogram and map using the ‘Chord’ distance and ‘ward’ linkage. Are there strong differences to the square root transformation?
Solution
Z = linkage(d,'ward')
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.show()
poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)
g_loc_s["PollClass"] = poll_class
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification using Chord distance")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)The differences are small. Relative differences in the dendrogram are more pronounced and 5 samples fall into a different cluster.
K-Means Clustering
Once you exhausted all the combinations of distance measures and linkages it is time to try K-Means Clustering. K-Means is computed using Euclidean distance, which as we discussed above is sensitive to absolute differences between variables and it is therefore recommended to standardize the data prior to running K-Means. Above we did a sample based standardisation which we may use.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=2, n_init=20).fit(poll_st)Setting a random seed using the random_state argument ensures that the initial cluster assignments can be replicated and the K-means output will be fully reproducible. The n_init argument was set to run the K-means with 20 initial cluster assignments (the default is 10). James et al. (2023) recommend always running K-means clustering with a large value of n_init, such as 20 or 50 to avoid obtaining a local optimum.
Class labels are obtained with: ‘kmeans.labels_’ Use these to overwrite the PollClass in the ‘g_loc_s’ dataframe to look at the classification on the map using the same code as before.
Solution
g_loc_s["PollClass"] = kmeans.labels_
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification with KMeans")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)Here you don’t produce a dendrogram but obtain an optimal cluster configuration for a given number of classes. Thus the Question for how many classes should you compute this analysis becomes more relevant. One way to evaluate this is using the scree plot.
sse = {}
for k in range(1, 15):
kmeans = KMeans(n_clusters=k, random_state=2, n_init=20).fit(poll_st)
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()The position of the ‘elbow’ where the curve changes from a steep to a more gentle decline is the number of clusters that should be used. So in our case 4 seems like a good choice.
Will the number of recommended clusters be different if we work with the full dataset? How does the K-Means classification of the full data compare to the one obtained with Chord distance and Ward linkage? Which one gives better spatial clusters?
Solution
poll_st = normalize(poll_surf, axis=1)
# KMeans
kmeans = KMeans(n_clusters=4, random_state=2, n_init=20).fit(poll_st)
geo_loc.insert(1, "PollClass", kmeans.labels_)
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
geo_loc.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification with KMeans")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)
sse = {}
for k in range(1, 15):
kmeans = KMeans(n_clusters=k, random_state=2, n_init=20).fit(poll_st)
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
# Hierarchical clustering
d = distance.pdist(poll_st,'euclidean')
Z = linkage(d,'ward')
poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)
geo_loc["PollClass"] = poll_class
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
geo_loc.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification Hierarchical clustering")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)The scree plot looks very similar for the reduced and the full dataset. Both the KMeans classification of the full data as well as the Hierarchical clustering yield similar results with some marked differences. Both separate a group of northern sites relatively well and recognize a coastal group for Norway, but don’t resolve the border between the hemiboreal and boreal forests well (e.g. Estonia and Southern Finland are in the hemiboreal zone). Neither gives good spatial clusters.
Please explore the full data with all insights you leaned above. Consider applying cluster algorithms after data transformation or different ways for standardization. Decide on a classification you like best and motivate your choice
Solution
I like the result of ward linkage using the Euclidean distance after square-root transformation. The resulting dendrogram indicates 3 strong clusters, which are also visible in the map. There is a separation in the coastal cluster which result in some coastal sites being separated out. A separation of the largest cluster is only achieved when the tree is cut into 5 clusters. Here the above mentioned separation into hemiboreal and boreal forests is somewhat visible and this may be reasonable. However, it is hard to evaluate and thus for a large scale view I rather stay with 3 well separated clusters that I call: 0) Boreal and hemiboreal forests; 1) Northern and high elevation sites; 2) Coastal and Western sites.
poll_st = poll_surf.apply(np.sqrt)
d = distance.pdist(poll_st,'euclidean')
Z = linkage(d,'ward')
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.show()
poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=3)
geo_loc["PollClass"] = poll_class
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
geo_loc.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification Hierarchical clustering")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)Mixing in nominal variables
As you have noticed the data we are working with has not just pollen percentage data but there is also environmental information in the data holding the geographic coordinates. I have added one nominal variable to that table, which is a label whether or not sites are near the coast. So let us collect some meaningful variables from that table and look at them.
env = sites.loc[:, ("Tjuly", "Tjanuary", "Precipitation", "Range", "tmp_tetra", "elev", "is_coastal")]We have January and July temperature as well as annual precipitation that were interpolated from the nearest weather station as well as the temperature differences ‘Range’. The variables tmp_tetra (average warm season temperature) and elev=elevation were extracted from matching the sites to a climate grid, thus elevation is likely underestimated. Nevertheless, we like to add another categorical variable to describe the highly different situation of the vegetation structure below and above the tree line. The tree line declines in Scandinavia from around 1000 m in the south to a 100 m in the north. We like to focus on the south and considering the generalization of the mountain relief due to gridding set the tree line at 700 m. We can create that new variable as follows:
t_tree = env['elev'] > 0.7
env['is_a_tree'] = np.multiply(t_tree, 1)The variables we now have assembled in the table have all different units. Already temperature and precipitation are different things and with the two categorical variables we are now using completely different values in the comparison while the absolute differences of these between e.g. temperature and precipitation have no meaning. Thus it is time to standardize the data. One way to do that is using the function in scikit-learn:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
env_s = scaler.fit_transform(env)How does that function standardize the data as a default? Using the same ‘default’ calculate the scandalization for the column ‘is_a_tree’ in python and provide the code. What is the standardized value of a ‘0’ and a ‘1’? Are these numbers exactly the same as the output of StandardScaler()?
Solution
*Answer: Standardization is done column wise, subtracting the mean and dividing by the standard deviation.
mean = env\['is_a_tree'\].mean() (env\['is_a_tree'\]-mean)/env\['is_a_tree'\].std()The ‘0’ value becomes = -0.291372; and ‘1’ is scaled to 3.423624. These values may be slightly different from the output of StandardScaler() as the std function in Pandas uses by default N-1, which you can chance and will get exactly the same values.*
Adopt the above code and make a cluster analysis of the data with environmental descriptors. 1) How many clusters would you use? Map these clusters as before and interpret the results: 2) Which environmental gradient gave rise to the separation of the two largest clusters? 3) How much did the addition of the two categorical variables influence the clustering? Look into the output or run additional analysis to find all answers.
Solution
Answers:
d = distance.pdist(env_s,'euclidean')
Z = linkage(d,'ward')
plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.show()
env_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)
geo_loc["EnvClass"] = env_class
fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
geo_loc.plot(column="EnvClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("environmental Classification")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)- I would use 4 clusters as they are well separated.
- From just eyeballing the results it seems that summer warmth is separating the samples into a northern and a southern cluster.
- The addition of the categorical variables influenced the results substantially. Running the analysis without these two categorical variables results in 3 distinct clusters, separating coastal sites and combining southern sites in the mountains with northern sites. A division into 4 classes would separate the mountain sites in a similar, but not the same way as our assignment. However, the assignment of coastal sites did not result in a cluster of exactly the sites assigned to ‘is_coastal’. You could also compare the number of samples assigned to the different values and what is returned as number of samples per class:
unique_values, counts = np.unique(poll_class, return_counts=True)
for value, count in zip(unique_values, counts):
print(f"{value} occurs {count} times")