import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
# Other libraries for wk4 for plots and maps
import seaborn as sns
import geopandas
from shapely import wkt
# Other libraries for wk4 with statistical models
from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lmMapping, plots and Variance
Sibes waddendata
We will be working with the same SIBES dataset as last week and continue with more intricate plotting and statistical analyses including multiple groups and interactions between variables.
We need to load some more libraries compared to last week including some to work with spatial data, do check if these load properly! if not, you may be working outside the recommended enviroment for this course, check out the installation instructions to make sure you have followed all steps. (the python version we use is 3.12, you can check this easy in the top bar of spyder. If you are using 3.11 or older you are not in the right environment)
We continue with the loaded data from last Thursdays exercises, copy you own code for loading and merging locationsediement.csv and locationbiota.csv. (or copy from the hint below…)
Hint
#load the sediment data, set column names and types
rawdata = pd.read_csv("../Data/locationsediment.csv")
sediment = pd.DataFrame({
"geom": rawdata['geom'],
"date": pd.to_datetime(rawdata['date'], format='%Y-%m-%d'),
"station_id": rawdata['sampling_station_id'],
"sample_id": rawdata['sample_id'],
"grainsize_median": rawdata['sediment_median'],
"silt_proportion": rawdata['sediment_silt'],
} )
# add year as a catergorical column for easy selection and merging
sediment['year'] = sediment['date'].dt.year
#load the biota data, set column names and types
rawdata = pd.read_csv("../Data/locationbiota.csv")
biota = pd.DataFrame({
"geom": rawdata['geom'],
"date": pd.to_datetime(rawdata['date'], format='%Y-%m-%d'),
"sampling_type": rawdata['sampling_type'].astype("category"),
"sample_platform": rawdata['platform'].astype("category"),
"tidal_basin_name": rawdata['tidal_basin_name'].astype("category"),
"flat_name": rawdata['flat_name'].astype("category"),
"species": rawdata['name'].astype("category"),
"species_type": rawdata['type'].astype("category"),
"species_aphia_id": rawdata['aphia_id'].astype("category"),
"abundance_m2": rawdata['abundance_m2'],
"biomass_afdm_m2": rawdata['afdm_m2'],
} )
biota['year'] = biota['date'].dt.year
# Merge metadata
biota_metadata = biota[['year', 'geom', 'sampling_type', 'sample_platform', 'tidal_basin_name', 'flat_name']].drop_duplicates()
sibes_merged = pd.merge(left = sediment,
right = biota_metadata,
on = ['year', 'geom'],
how ='left')
# we cannot simply join the biomass values because they dont fit 1:1 but we can summerize them to the sample loctions/years and sum up all species.
biota_sums = biota.groupby(['geom', 'year'], observed=True).sum(numeric_only=True)
# Perform the merge on 'year' and 'geom'
sibes_merged = pd.merge(left = sibes_merged,
right = biota_sums,
on =['year', 'geom'],
how ='inner')for a more interesting analysis we want to retain more data, for example we can select specific species/groups or pivot to wide format with data columns for each entry. For this exercise we want to look not only at total biomass, but also biomass for individual species groups. We use the pivot_table function to create these for each of the groups.
# Pivot the data to get species types as separate columns
biomass_pertype = biota.pivot_table(
index=['geom', 'year'],
columns='species_type',
values=['biomass_afdm_m2'],
aggfunc='sum',
observed=True,
fill_value=0
)
# Check out the resulting dataframe it now has a multi level column header which makes things behave difficult for the further statistics and plotting (or for some functions is actually very handy...).
# Rename the columns and flatten the dataframe.
biomass_pertype.columns = [f"{col[0]}_{col[1]}" for col in biomass_pertype.columns]
biomass_pertype = biomass_pertype.reset_index()
# Again merge the new data on 'year' and 'geom'
sibes_merged = pd.merge(sibes_merged, biomass_pertype, on=['year', 'geom'], how='inner')The distributions of Biomass data are heavily skewed which does not make for nice plotting and reduces the usability of many statistical functions. We therefore log-transfer biomass to make the distribution closer to normal.
Q1 you cannot log transform 0 (zero), what would be an appropriate value to replace zeroes in the transformed dataset? NaN could be used, but then you loose a lot of data points
Solution
Q1: We can mask zeroes and replace them with -5, when we take a natural log (np.log) -5 is a good value as this falls just below the data range.
# We replace zeroes with -3 for a really low value rather than with 0 which falls in the middle of our log-transformed range (suitable replacement depends on your dataset)
sibes_merged['polychaete_logbiomass'] = np.log10(sibes_merged['biomass_afdm_m2_polychaete'].mask(sibes_merged['biomass_afdm_m2_polychaete'] <= 0.0)).fillna(-5)
sibes_merged['bivalve_logbiomass'] = np.log10(sibes_merged['biomass_afdm_m2_bivalve'].mask(sibes_merged['biomass_afdm_m2_bivalve'] <= 0.0)).fillna(-5)We leave this for now so further solutions remain the same, but during the practicals we came up with an alternative. Replace the original 0 values with a very low nr and then run the log transformation. also using a log10 is a bit more intuitive… then 10-3 (0.001) would be a better replacement value. This changes the outcomes of further testing, so just shown as an option her
# We replace zeroes with (0.001) so the transformation work well, this is much neater than the original of masking zeroes and replacing after...
sibes_merged['biomass_afdm_m2_polychaete'] = sibes_merged['biomass_afdm_m2_polychaete'].mask(sibes_merged['biomass_afdm_m2_polychaete'] <= 0.0, 0.001)
sibes_merged['polychaete_logbiomass'] = np.log10(sibes_merged['biomass_afdm_m2_polychaete'])
sibes_merged['biomass_afdm_m2_bivalve'] = sibes_merged['biomass_afdm_m2_bivalve'].mask(sibes_merged['biomass_afdm_m2_bivalve'] <= 0.0, 0.001)
sibes_merged['bivalve_logbiomass'] = np.log10(sibes_merged['biomass_afdm_m2_bivalve'])# We replace zeroes with -3 for a really low value rather than with 0 which falls in the middle of our log-transformed range (suitable replacement depends on your dataset)
sibes_merged['polychaete_logbiomass'] = np.log10(sibes_merged['biomass_afdm_m2_polychaete'].mask(sibes_merged['biomass_afdm_m2_polychaete'] <= 0.0)).fillna(-5)
sibes_merged['bivalve_logbiomass'] = np.log10(sibes_merged['biomass_afdm_m2_bivalve'].mask(sibes_merged['biomass_afdm_m2_bivalve'] <= 0.0)).fillna(-5)
#Reorder the dataframe with the desired column order we leave some data columns out, but you can easily uncomment to retain them
sibes_merged = sibes_merged[[
'geom',
'station_id',
'date',
'year',
'sample_id',
'sampling_type',
'sample_platform',
'tidal_basin_name',
'flat_name',
'grainsize_median',
'silt_proportion',
#'abundance_m2',
'biomass_afdm_m2',
#'biomass_afdm_m2_bivalve',
#'biomass_afdm_m2_crustaceans',
#'biomass_afdm_m2_polychaete',
#'biomass_afdm_m2_other',
'polychaete_logbiomass',
'bivalve_logbiomass'
]]
# To not overwhelm our plots with to many points to see, we again limit the area and timespan of the data
sibes = sibes_merged.loc[sibes_merged['tidal_basin_name'].isin(["Marsdiep", "Eierlandse Gat", "Vlie", "Borndiep"] ) , : ]
sibes['tidal_basin_name'] = sibes['tidal_basin_name'].cat.remove_unused_categories()
sibes = sibes.query('year == 2009') With all data neatly ordered, we can also make a quick selection (in this case of a single data point) and eg plot the changes in biomass trough the years for a single point
station = sibes_merged[sibes_merged['station_id'] == 2669]
plt.plot(station['year'], station['bivalve_logbiomass'], marker='o', linestyle='-', label='Bivalve')
plt.plot(station['year'], station['polychaete_logbiomass'], marker='o', linestyle='-', label='Polychaete')
plt.show()The figure created with the three lines above misses lots of thing, like legend, title, axes labels, and the lines/points are not in order of years… add to your code so the plot resembles the example below.

This figure is not all that pretty yet, and everything is rather small. Make adjustments to the figure to make it well suited for a powerpoint presentation or poster (meaning it needs to be read from quite a distance). Save the figure as .png from your code (not via the gui) and import in in ppt, is everything still the same size and well readable? if not adjust size and resolution to make it work.
Solution
Q2: We can mask zeroes and replace them with -5, when we take a natural log (np.log) -5 is a good value as this falls just below the data range.
# With all data neatly ordered, we can also make a quick selection and eg plot the changes in biomass trough the years for a single point
station = sibes_merged[sibes_merged['station_id'] == 2669].sort_values('year')
# Plotting
plt.figure(figsize=(12, 6))
plt.plot(station['year'], station['bivalve_logbiomass'], marker='o', linestyle='-', label='Bivalve')
plt.plot(station['year'], station['polychaete_logbiomass'], marker='o', linestyle='-', label='Polychaete')
plt.title('Biomass of taxonomic groups Over Years')
plt.xlabel('Year')
plt.ylabel('logBiomass (m²)')
plt.grid(True)
plt.xticks(station['year'])
plt.legend()
plt.show()Q3: for a presentation ma many things (text, linewidth, etc) large for better readability and remove grid. many other options are possible
plt.figure(figsize=(7.5, 5))
plt.plot(station['year'], station['bivalve_logbiomass'], marker='o', linestyle='-', linewidth=3, label='Bivalve')
plt.plot(station['year'], station['polychaete_logbiomass'], marker='o', linestyle='-',linewidth=3, label='Polychaete')
plt.title('Biomass of taxonomic groups Over Years', size='xx-large', weight=600)
plt.xlabel('Year', size='x-large', weight=600)
plt.ylabel('logBiomass (m²)', size='x-large', weight=600)
plt.xticks(station['year'], size='large')
plt.xticks(size='large')
plt.legend( prop = {"size": 'large'})
plt.grid(False)
plt.savefig('s08q04.png', dpi=300)
plt.close()ANOVA and Seaborn plots
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Check out the example gallery of seaborn. Many of the example plots combine more that one plot type (scatterplot, line, histograms) to show how datasets are grouped, related, and distributed. some of these are good for data exploration, while others allow for more intricate storytelling.
sns.pairplot creates a scatterplot matrix which is a really good tool for exploring a dataset as it displays both fequency plots as scatteplots between a range of variables. You will need to limit the datase somewhat to not create giant unreadable matrices which take forever to generate…
#two examples. colored by sampling platform or by tidal basin
sns.pairplot(sibes[['sample_platform','tidal_basin_name','grainsize_median','silt_proportion','polychaete_logbiomass','bivalve_logbiomass']], hue='sample_platform')
sns.pairplot(sibes[['sample_platform','tidal_basin_name','grainsize_median','silt_proportion','polychaete_logbiomass','bivalve_logbiomass']], hue='tidal_basin_name')Check out he plot. did log-transforming make the the distibutions of biomass normal?
Check out he plot. Which of the variables have the largest differences between basins? and sample platform?
Solution
Q4: Not entirely normal, but close enough for many tests. esp bivalves has lots of 0 values, so looks quite bimodal in the fequency plot
Q5: There seems to be lots of overlap between the basins, Eierlandse gat has a tighter distribution for sediments, Borndiep seems to have a bit more bivalve biomass and less polychaete. Differences are difficult to recognize behind the effect of sample nrs per basin. For sample platform there ‘foot’ samples have lower silt proportions
Sums of squares
Using the two sample platforms as groups, calculate the total sum of squares for median grainsize, the within group sum of squares and the between group sum of squares (use python, show your steps)
Calculate the F statistic for the your samples. how many degrees of freedom do you use in the calculation?
Solution
Q6: You can create a dataframe with for each point the sample, the difference between group mean and sample (ingroup error) and the difference between its group mean and the overall mean (group_effect), and difference between each sample and the overall mean (sample_error). Then calculate the sums of squares over all values in that df. many other options are possible like splitting the data into groups first, calcultion group effects and adding them back together.
Q7: F=55.08. we use ngroups - 1 (=1) degrees of freedom for the treatment and n-ngroups (=2586-2) for the residual (DFerror)
s1 = pd.DataFrame({
'sample' : sibes['grainsize_median'],
'ingroup_error' : sibes['grainsize_median'].sub(sibes.groupby('sample_platform')['grainsize_median'].transform('mean')),
'group_effect' : sibes.groupby('sample_platform')['grainsize_median'].transform('mean') - sibes['grainsize_median'].mean(),
'sample_error' : sibes['grainsize_median'] - sibes['grainsize_median'].mean()
})
SStotal = sum(s1['sample_error']**2)
SSerror = sum((s1['ingroup_error'])**2)
SStreatment = sum((s1['group_effect'])**2)
DFtreatment = len(sibes.groupby('sample_platform')) -1
DFerror = s1['sample'].count() - len(sibes.groupby('sample_platform'))
F = ((SStreatment/DFtreatment)/(SSerror/DFerror))
print(F)We can also use the anova function to calculate similar statistics
#example anova
anova_lm( ols('grainsize_median ~ sampling_platform',sibes).fit())Run the ANOVA for sample platform on median grainsize with the anova_lm function. are your resuts the same as you hand calculation?
For the two sample groups we could have also run a t-test on the sample means (as we did last week) rather than this ANOVA. compare the outcome when using F-test vs t-test how large are the differences?
Solution
Q8: The values in the annova should be the same as your hand calculation. also see if the sum_sq and df aree the same. If the values are different, you can quickly check if SStotal - (SSerror + SStreatment) = 0 which should be the case.
Q9: we can compare the p-values for the two tests, they are extremely low (so significant difference in group) and the same (p=1.557e-13)
s1 = sibes.loc[sibes['sample_platform']=="boat",'grainsize_median']
s2 = sibes.loc[sibes['sample_platform']=="walk",'grainsize_median']
print(stats.ttest_ind(s1, s2))For multiple groups we can use a combination of anova and the tukey_hsd tests to figure out if there are differences between the basins and which differences are significant. run an anova for the variable you think is most difference
#example anova for median grainssize and the tidal basins
anova_lm(ols('grainsize_median ~ tidal_basin_name',sibes).fit())
# if the anova indicates a significant difference, tukey_hsd can indicate which of the individual groups are different
MultiComparison(sibes['grainsize_median'], sibes['tidal_basin_name']).tukeyhsd().summary()Is there is significant difference between the group according to the anova? Which basins have significant differences? Explain in your own words what this result means.
Solution
Q10: The anova (f-test) shows if there is any difference between the basins at all which is significant. The Tukey_hsd test makes comaparisons between all individual pairs of basins. We cannot significantly differntiate Marsdiep and Vlie, all other do significantly differ.
Covariance and correlation
So far we have been looking at groups in data for comparison, but not relations between variables. Go back to your scatterplot matrix, but now look which variables are related.
Which of the variables are related? which variables show no relation at all?
df.cov() and df.corr() calculation covariance and correlations between all variables in your dataframe (use numeric_only=True to avoid errors with date and categorical fields).
which variables have the highest covariance values? which ones the strongest correlation? Why are these different?
Disregard variables that are directly calculated from eachother (such as biomass and logbiomass), which variables in our data are strongly related?
Solution
Q11: Most variable are only weakly related, grainsize_median and silt_proportian are clearly correlated since they are different summary values for the same sediment size distributions of the sample (hence the somewhat odd shape)
Q12/13: abundance and biomass have very high covariance values because their variances itself (and means) are also large. In the correlation matrix the relations are normalized so the original values matter less. There bivalve biomass and total biomass are very strongly correlated (apparently bivalves make up large part of the total biomass.) quite interesting is the correlation between silt_proportion and bivalve_logbiomass which (0.24), not very high, but na notable relation between sediment and biology
Based on group differences and correlations, select two related variables and one grouping variable that have an interesting story. Select and make one of the figure types from the seaborn gallery and make a pretty figure. set all relevant aesthetics to fit the data well.
Spatial data and maps
As you have seen when exploring the SIBES data online (and from the coordinate data in the our table), the SIBES data is spatial. That means it would be very logical to display it on a map. Details about mapping and spatial analysis are beyond the scope of this course (see GIS and Remote Sensing), but python has many tools to handle spatial information.
Geopandas is a library for Python that makes working with spatial data quite easy. To convert our dataframe to geopandas we have to indicate what the geometry (in our case point coordinates) of the dataframe is
# make this a spatial dataframe and plot a map
sibes["Coordinates"] = geopandas.GeoSeries.from_wkt(sibes["geom"])
geosibes = geopandas.GeoDataFrame(sibes, geometry='Coordinates', crs='epsg:4326')Make some plots using geosibes.plot(), try out at median grainsize and logbiomass as variables, or some other ones if you fancy. what is the difference between the plotting the non-spatial data as we have done before
Solution
Q15: With a spatial dataframe we standard get a coordinate grid (map), we can choose a variable to color the points
geosibes.plot(column='grainsize_median')there are many more fancier options such as using the geosibes.explore() function that creates en interactive map (with basemap) in your browser.
#NB there is a little but in Windows where the browswer does not directly return control to python when closing the window (or pressing ctr+c) after you close the window python will get control back after a few minutes... (press clt+. to restart the python kernel, you have to rerun the entire script (libraries, data loading etc) after that.
geosibes.explore(column='bivalve_logbiomass').show_in_browser()Explore the bivalve biomass map. Where do the largest concentrations of clams occur in the studied part of the wadden sea? Is there a strong spatial relation between bivalves and sediment size?
Solution
Q16: bivalve distribution seems a bit split, in general higher values near the mainland, but also high values are found in banks more near the outer channels, this pattern makes that a clear relation to sediment size is difficult to see.