Cluster Analysis

Today we will work with real world data exploring cluster analysis of pollen data. If you don’t really know what pollen is have a look at this video or this TedTalk. Pollen grains released by plants mix in the air and are deposited in lakes and bogs were they preserve. Thus, analysing the pollen content from one sample is like a remote sending tool for vegetation cover that works for thousands of years into the past. Modern datasets are needed to study the pollen content of samples in relation to environmental factors like climate. The data that we want to look at is split into two files, with one containing the percentages of the pollen counts (Pol_surf.csv) and the other information on the site location as well as some climate parameters (Pol_site.csv). Reads both datasets into Spyder and have a look at them.

import pandas as pd
poll_surf = pd.read_csv("../Data/Pol_surf.csv")
sites = pd.read_csv("../Data/Pol_site.csv")

Question 1

What kind of data do we have in the poll_surf set? Calculate the row sums to investigate.

To get an overview of the data cover we best plot it on a map. For now, we don’t really need all the climate data, but would only like to retain July Temperature and use it for mapping.

import geopandas
import geodatasets
import matplotlib.pyplot as plt

# Get a base map 
world = geopandas.read_file(geodatasets.get_path("naturalearth.land"))

# Select variables into a new dataframe 
loc = sites.loc[:, ("latitude", "longitude", "Tjuly")]
# Create point geometries
geometry = geopandas.points_from_xy(loc.longitude, loc.latitude)
geo_loc = geopandas.GeoDataFrame(
    loc[["latitude", "longitude", "Tjuly"]], geometry=geometry
)

fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
geo_loc.plot(column="Tjuly", ax=ax, legend=True, categorical=False, cmap='OrRd', markersize=50)
plt.title("July Temperature")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)

To get a feel for cluster analysis it is useful to restrict the dataset both in terms of taxa as well as in the number of samples. The pollen data contains information for one hundred taxa, many of these are rare. So, at a first step we can reduce the taxa to those with an abundance of at least 2%:

poll_c=poll_surf.loc[:,(poll_surf > 2).any()]

Question 2

How many taxa exceed this criterion?

We need to take care when drawing a random sample out of available sites to also get the sample location. As the index is preserved, we can straight away sample the pollen data. We can then extract the index and use it to select the location data. To make sure all of you take the same sample we set the seed.

import numpy as np
np.random.seed(2)
poll_s = poll_c.sample(50)
idx = poll_s.index.values.tolist()
g_loc_s = geo_loc.iloc[idx]

Question 3

Using the above code and the July temperature make a map to look at the samples that you extracted.

Finally, we got a dataset ready to explore. You already calculated the Euclidean distance between samples, so let us start looking at that distance for this new dataset.

import scipy.spatial.distance as distance

d = distance.pdist(poll_s,'euclidean')
ds = distance.squareform(d)

Have a look at the resulting distance matrix. It is rather large so it is useful to compute a heat map.

from matplotlib import cm
fig, ax = plt.subplots()
c = ax.imshow(ds,cmap=cm.Oranges)

Question 4

What does the matrix tell us? What do the dark colours indicate, similarity or dissimilarity? What can you say about sample 30? Find two samples that are most similar.

Agglomerative clustering

Now it is time to find the samples most similar and link them together into clusters. We start using ‘single’ linkage as that is conceptually the simplest.

from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(d,'single')

Have a look at the output for the 15 first samples.

print(Z[:15,:])

It is of cause easier to investigate these links using the dendrogram:

plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.xticks(fontsize = 12) 
plt.show()

Question 5

Which samples are most dissimilar from the rest?.

This cluster analysis results in a skewed distribution of clusters with one large cluster, one cluster of there and two clusters of one sample. This is not ideal for our problem, but let us still have a look at the map to see where these clusters lie. In order to do that, we first need to cut the dendrogram into the desired number of classes and insert these into our GeoDataFrame:

from scipy import cluster
poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)

g_loc_s.insert(1, "PollClass", poll_class)

fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)

Question 6

Where do you find the 5 samples that are most different from all other samples?

Question 7

In the above code you have two parameters that determine the output of your cluster analysis: i) the distance measure and the linkage. Have a look at the available distance measures in pdist as well as the different linkage options and explore different combinations. Among other combinations try ’cityblock’ with ‘ward’! Make dendrograms and maps (at least 2). Look out for the previously identified cluster of 3 most different samples. Is that cluster maintained in other classifications?

Knowing that the Euclidean distance is strongly influenced by absolute differences, these differences can be reduced using a square root transformation.

poll_sq = poll_s.apply(np.sqrt)
d = distance.pdist(poll_sq,'euclidean')
Z = linkage(d,'ward')

plt.figure(figsize=(8, 6))
dendrogram(Z)
plt.xlabel('Sample ID')
plt.ylabel('Distance')
plt.show()

poll_class = cluster.hierarchy.cut_tree(Z, n_clusters=4)

g_loc_s["PollClass"] = poll_class

fig, ax = plt.subplots(figsize=(10, 6))
world.plot(ax=ax, alpha=0.4, color="grey")
g_loc_s.plot(column="PollClass", ax=ax, legend=True, categorical=True, markersize=50)
plt.title("Vegetation Classification using square root transformation")
ax.set_xlim(4, 32)
ax.set_ylim(55, 72)

We learned that the Chord distance is a Euclidean distance based on normalized samples. Thus although this distance is not provided it can be computed:

from sklearn.preprocessing import normalize
poll_st = normalize(poll_s, axis=1) 
d = distance.pdist(poll_st,'euclidean')

Question 8

Compute a dendrogram and map using the ‘Chord’ distance and ‘ward’ linkage. Are there strong differences to the square root transformation?

K-Means Clustering

Once you exhausted all the combinations of distance measures and linkages it is time to try K-Means Clustering. K-Means is computed using Euclidean distance, which as we discussed above is sensitive to absolute differences between variables and it is therefore recommended to standardize the data prior to running K-Means. Above we did a sample based standardisation which we may use.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=2, n_init=20).fit(poll_st)

Setting a random seed using the random_state argument ensures that the initial cluster assignments can be replicated and the K-means output will be fully reproducible. The n_init argument was set to run the K-means with 20 initial cluster assignments (the default is 10). James et al. (2023) recommend always running K-means clustering with a large value of n_init, such as 20 or 50 to avoid obtaining a local optimum.

Question 9

Class labels are obtained with: ‘kmeans.labels_’ Use these to overwrite the PollClass in the ‘g_loc_s’ dataframe to look at the classification on the map using the same code as before.

Here you don’t produce a dendrogram but obtain an optimal cluster configuration for a given number of classes. Thus the Question for how many classes should you compute this analysis becomes more relevant. One way to evaluate this is using the scree plot.

sse = {}
for k in range(1, 15):
    kmeans = KMeans(n_clusters=k, random_state=2, n_init=20).fit(poll_st)
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

The position of the ‘elbow’ where the curve changes from a steep to a more gentle decline is the number of clusters that should be used. So in our case 4 seems like a good choice.

Question 10

Will the number of recommended clusters be different if we work with the full dataset? How does the K-Means classification of the full data compare to the one obtained with Chord distance and Ward linkage? Which one gives better spatial clusters?

Question 11

Please explore the full data with all insights you leaned above. Consider applying cluster algorithms after data transformation or different ways for standardization. Decide on a classification you like best and motivate your choice

Mixing in nominal variables

As you have noticed the data we are working with has not just pollen percentage data but there is also environmental information in the data holding the geographic coordinates. I have added one nominal variable to that table, which is a label whether or not sites are near the coast. So let us collect some meaningful variables from that table and look at them.

env = sites.loc[:, ("Tjuly", "Tjanuary", "Precipitation", "Range", "tmp_tetra", "elev", "is_coastal")]

We have January and July temperature as well as annual precipitation that were interpolated from the nearest weather station as well as the temperature differences ‘Range’. The variables tmp_tetra (average warm season temperature) and elev=elevation were extracted from matching the sites to a climate grid, thus elevation is likely underestimated. Nevertheless, we like to add another categorical variable to describe the highly different situation of the vegetation structure below and above the tree line. The tree line declines in Scandinavia from around 1000 m in the south to a 100 m in the north. We like to focus on the south and considering the generalization of the mountain relief due to gridding set the tree line at 700 m. We can create that new variable as follows:

t_tree = env['elev'] > 0.7
env['is_a_tree'] = np.multiply(t_tree, 1)

The variables we now have assembled in the table have all different units. Already temperature and precipitation are different things and with the two categorical variables we are now using completely different values in the comparison while the absolute differences of these between e.g. temperature and precipitation have no meaning. Thus it is time to standardize the data. One way to do that is using the function in scikit-learn:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
env_s = scaler.fit_transform(env)

Question 12

How does that function standardize the data as a default? Using the same ‘default’ calculate the scandalization for the column ‘is_a_tree’ in python and provide the code. What is the standardized value of a ‘0’ and a ‘1’? Are these numbers exactly the same as the output of StandardScaler()?

Question 13

Adopt the above code and make a cluster analysis of the data with environmental descriptors. 1) How many clusters would you use? Map these clusters as before and interpret the results: 2) Which environmental gradient gave rise to the separation of the two largest clusters? 3) How much did the addition of the two categorical variables influence the clustering? Look into the output or run additional analysis to find all answers.