Sibes data & Testing means

Sibes data

For session 6 (and 8) we will work with data collected in the Dutch wadden sea as part of the SIBES program at NIOZ. This is an intensive monitoring program for benthic macrofauna (mostly shells and worms etc) that live in the tidal mudflats. Check out https://basismonitoringwadden.waddenzee.nl/datahuis-0/overzicht-meet-monitorprogramma/overzicht-meet-monitorprogramma/bodemfauna/b27-synoptic-intertidal-benthic-surveys/ for a more comprehensive description of the monitoring project and if you are interested the journal papers referenced on the website (Bijleveld et al., 2012 about the sampling design is especially relevant to our coursework)

Public datasets are available on the ‘wadden viewer’ which also has many other data sources for the Wadden sea. We are using two datasets specifically, the Sibes Biota and Sibes Sediment which are downloaded as .csv and available in the ../Data/ folder for the course.

TipQuestion 1

What sampling lay-out is used for the SIBES program?

TipQuestion 2

How large are is surface area (and volume?) of each sample?

TipQuestion 3

to what depth is the sediment composition sampled and analyzed? From what level is this depth referenced?

Hint check the data description website and paper for this information. (you should always read up on the backgrounds, methods, etc of any data you receive so you understand what you are working with)
Solution

Q1: 500m grid (on all interdital areas of waddensea) Plus an extra 10% random points

Q2: 0.017 m2 (0.00425 m3)

Q3: 4 centimeter from the top of the mud/sand

Loading and data preparation

We start our script with loading the required modules as always. Then the two datafiles are read and cast into the dataframes and data types in the way we like.

import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
#load the sediment data, set column names and types
rawdata  = pd.read_csv("../Data/locationsediment.csv")
sediment = pd.DataFrame({
    "geom": rawdata['geom'], 
    "date": pd.to_datetime(rawdata['date'], format='%Y-%m-%d'),
    "station_id": rawdata['sampling_station_id'], 
    "sample_id": rawdata['sample_id'], 
    "grainsize_median": rawdata['sediment_median'],
    "silt_proportion": rawdata['sediment_silt'],
     } )
# add year as standard numerical column for easy selection and merging
sediment['year'] = sediment['date'].dt.year

#load the biota data, set column names and types
rawdata  = pd.read_csv("../Data/locationbiota.csv")
biota = pd.DataFrame({
    "geom": rawdata['geom'], 
    "date": pd.to_datetime(rawdata['date'], format='%Y-%m-%d'),
    "sampling_type": rawdata['sampling_type'].astype("category"),
    "sample_platform": rawdata['platform'].astype("category"),
    "tidal_basin_name": rawdata['tidal_basin_name'].astype("category"),
    "flat_name": rawdata['flat_name'].astype("category"),
    "species": rawdata['name'].astype("category"),
    "species_type": rawdata['type'].astype("category"),
    "species_aphia_id": rawdata['aphia_id'].astype("category"),
    "abundance_m2": rawdata['abundance_m2'],
    "biomass_afdm_m2": rawdata['afdm_m2'],
    } )
biota['year'] = biota['date'].dt.year

Both datasets were collected during the same sampling campaigns, but made available in separate files for sediment information and biological (species) data. Check out the loaded dataframes. what are the sizes (columns, rows). which data do they have in common? are there empty or missing data (in which columns?). Do any columns contain lots of duplicated data?

TipQuestion 4

Especially in the biota dataframe we have set many columns with at ‘category’ data type. check out the pandas reference and cheatsheet. Explain why this datatype is suitable for the columns it it used on.

TipQuestion 5

Consider ‘tidy’ data structures. Are the biota and sediment tables in a long, or wide format?

TipQuestion 6

Since both datasets originate from the same samples could we merge the two directly into one dataframe? Explain the main hurdles to combining these data files for analysis.

TipQuestion 7

If we were to merge (or join) the tables, which columns would make suitable indices to combine the tables?

Solution

Q4: Those are text which describe categories with relatively few unique values (they are nominal data type). Declaring them as category makes python recognize them as such and they can then be use to group the data, make legend entrie etc.

Q5: sediment is tidy and has a proper index (station_id and sample_id) we like to see this. Biota is long form (is also tidy) which is logical given the many possible species per sample, but unequal nrs of species at each location

Q6: biota is much longer because it has many rows per sample (one for each species) Q7: index colmun is missing in biota (sample_id in sediment would be ideal) but we can join on geom and date which are mostly exactly matching with sediment (we use year because date fields are special and cause trouble when joining)

Q7: To match the same samples we use the location (‘geom’) as matching field. because we have multple years of sampling ‘date’ or ‘year’ also need to be used as index.

Since many of the categorical data columns do only vary from one sample to another (and not for each species) we can quite easily add this information to the sediment data by only matching the unique samples from biota to the sediment table.

# Merge the unique sample metadata of the biota table with the sediment table.
# nrow (sediment) == nrow(sibes_merged)
biota_metadata = biota[['year', 'geom', 'sampling_type',  'sample_platform', 'tidal_basin_name', 'flat_name']].drop_duplicates()
sibes_merged = pd.merge(left = sediment, 
                        right = biota_metadata, 
                        on = ['year', 'geom'], 
                        how ='left')

Check out the resulting dataframe. in very many years the biota data has no entries (only small parts of the area were fully processed since 2015) we have little interest in the samples that were only partly processed. We can make a selection on the years again as we did in the knmi, but we can also change the join type for the merge step so only observations with matching data are retained.

TipQuestion 8

What join type (instead of “how = ‘left’”) would we set to only retain observations with matching data in each table?

Solution

Q8: ‘inner’ join. (see https://pandas.pydata.org/docs/reference/api/pandas.merge.html )

As seen, we cannot simply join the biomass values because they don’t fit 1:1 with the sediment table, but we can summerize them to the sample locations/years and use those field. As a quick first we can simply sum up all species to get the total biomass and abundance for each sample which does merge easily with the other data.

# total sums of biomass and abundance for all species
biota_sums = biota.groupby(['geom', 'year'], observed=True).sum(numeric_only=True)

# merge with the other dataframe (with sediment and biota metadata)
sibes_merged = pd.merge(left = sibes_merged, 
                        right =  biota_sums, 
                        on =['year', 'geom'], 
                        how ='inner')

# Now we reorder the dataframe to my desired column order just because it is prettier...
sibes_merged = sibes_merged[[
    'geom',
    'station_id',
    'date',
    'year',
    'sample_id',
    'sampling_type',  
    'sample_platform', 
    'tidal_basin_name', 
    'flat_name',
    'grainsize_median',
    'silt_proportion',
    'abundance_m2',
    'biomass_afdm_m2'
]]

Explore the dataset

We now have a prepared data table we can further explore. have a look at the tables, make some summaries, do quick histograms etc. Categorical variables can be be quickly used to group the data (.groupby) to explore different groupings of the dataset

# make a quick plot of all variables
sibes_merged.hist()

# summary of median grainsize for each tidal basin
sibes_merged[['grainsize_median']].groupby(sibes_merged['tidal_basin_name']).describe()
TipQuestion 9

of the sediment (grainsize and silt proportion) and biota (biomass and abundance) data, which variables look approximately normally distributed? what distributions do the other variables have?

TipQuestion 10

Use the .groupby and other relevant function let python show you how many samples are available in each tidal basin? What fraction of the samples was taken by boat (or on foot) for each basin?

Solution

Q9: Grainsize is the only with a (near)normal distrubution, the other (silt, biomass, abundance) are heavily skewed and look more like logarithmic or exponential distributions

Q10

# summary of median grainsize for each tidal basin
sibes_merged[['grainsize_median']].groupby(sibes_merged['tidal_basin_name']).describe()
# column 1 (count) > nr of samples per basin

sibes_merged[['sample_platform']].groupby(sibes_merged['tidal_basin_name']).value_counts(normalize='TRUE')
# gives direct proportions. without normalize=True (or with normalize=False) we get the actual sample nrs per platform.

Moving to statistical testing we will limit the datasets to a few basins and only a single year to make the outcomes more quickly interpreted and having repeated samples may quickly violate assumptions of independence in the data. We will use the westside of the waddensea (near Texel and Vlieland) for 2009 for now.

# now we can quite easily select just a few basins on the westside of the waddensea to run som statistics on and have the outputs still quickly interpretable
sibes = sibes_merged.loc[sibes_merged['tidal_basin_name'].isin(["Marsdiep", "Eierlandse Gat", "Vlie", "Borndiep"] ) , : ]
sibes = sibes.loc[sibes['year'] == 2009, ]
# cleanup empty categories as they will otherwise keep showing up in all analses
sibes['tidal_basin_name'] = sibes['tidal_basin_name'].cat.remove_unused_categories()
TipQuestion 11

How many samples do we have in our subset? Do you expect this to be sufficient to detect significant differences between groups if we further analyse the data?

Solution

Q11: 2586 samples (rows). This is plenty of samples to detect significant results, so the data size is not a limiting factor here.

Testing of group means

We now have data ready lets compare some group means by running a t-test…

# Define two groups (sampled by boat and on foot)
s1 = sibes.loc[sibes['sample_platform']=="boat",'grainsize_median'] 
s2 = sibes.loc[sibes['sample_platform']=="walk",'grainsize_median'] 
# Quick visual comparison by histogram
plt.hist(s1, density=True, alpha=0.5)
plt.hist(s2, density=True, alpha=0.5)
plt.show()
# The t-test
print(stats.ttest_ind(s1, s2))
TipQuestion 12

Check out the result of the t-test. Was the test significant? what does that even mean? Do you understand what and how we tested?

Solution Q12: p << 0.001, so very significant… probably no clue what we did or what it means :-)

We have seen that is trivially easy to run a t-test (or for that matter nearly any statistical test) on your dataset using python and it will spit out numbers. It is still up to you to choose (and justify) the right method and interpret the outcomes (and explain them to your audience).

Lets go back to the test above for median grainsize between ‘boat’ and ‘walk’ samples.

TipQuestion 13

What are the assumptions the data need to meet to allow for a t-test. Are they met in median grainsize for the two sampling platforms?

TipQuestion 14

Write out the formal testing hypotheses (H0 and H1) explain in common language what the interpretation is of rejecting H0.

TipQuestion 15

for these exercises we will use alpha=0.05 as significance level. Explain what this value means. What are other commonly used alpha values?

TipQuestion 16

what are the sample means and standard deviation for these groups? how much is the difference between the means?

Solution

Q13: The data are continuous, The sample is representative for the population. There is homogeneity of variance (i.e., the variability of the data in each group is similar). The distribution is somewhat near normal. For median grainsize these assumptions hold

Q14: H0 µ = µ0, Ha µ ≠ µ0. Rejection means it is unlikely the two samples are taken from the same population. (and the populations behind both samples are different)

Q15: alpha=0.05 means we accept a 5% chance of rejecting the null-hypothesis when it was true (type I error). When we need to be more certain because errors may have serious consequences (eg in lots of medical research, or critical engineering decisions) a lower alpha is often used 0.01 or even 0.001.

Q16:

sibes_merged[['grainsize_median']].groupby(sibes_merged['sample_platform']).describe()
s2.mean() - s1.mean() #20.87

run the same test for the grid vs random samples in the dataset, what are the group means and sd?

TipQuestion 17

is the difference between these groups significant?

TipQuestion 18

These are pretty close… lets try and test differences between the grid and random groups, but now for silt proportion (which is not normally distributed) what test would you use here?

run the test and explain the outcomes

Solution

Q17: p:0.07 > 0.05 so difference is not significant.

s1 = sibes.loc[sibes['sampling_type']=="random",'grainsize_median'] 
s2 = sibes.loc[sibes['sampling_type']=="grid",'grainsize_median'] 
print(stats.ttest_ind(s1, s2))

Q18: silt proportions is heavily non-normal, so we use a non-parametric test, Mann-Whitney in this case. p:0.03 < 0.05 so the group difference is significant based on this outcome.

s1 = sibes.loc[sibes['sampling_type']=="random",'silt_proportion'] 
s2 = sibes.loc[sibes['sampling_type']=="grid",'silt_proportion'] 
print(stats.mannwhitneyu(s1, s2))

When we have for example repeated measurements of the same samples (or in our case repeated sampling at the same locations), those samples are no longer independent. We can look at differences at matched locations by running a paired test. We will run this test to compare median_grainssize for 2009 and 2012.

# We need to go back a bit in the code to redefine our datasets with all years.
sibes = sibes_merged.loc[sibes_merged['tidal_basin_name'].isin(["Marsdiep", "Eierlandse Gat", "Vlie", "Borndiep"] ) , : ]
sibes['tidal_basin_name'] = sibes['tidal_basin_name'].cat.remove_unused_categories()
sibes.info()

# with the .pivot functions we can make a table with only median grainsize, but one column for each year
grainsize_byyear = sibes.pivot_table(
    index='station_id', 
    columns='year', 
    values='grainsize_median', 
    observed=True
)

# create your groups s1 / s2 and run your test. 
s1 = grainsize_byyear[2009]
s2 = grainsize_byyear[2012]
print(stats.ttest_ind(s1, s2, nan_policy='omit') )
print(stats.ttest_rel(s1, s2, nan_policy='omit') )
TipQuestion 19

How many valid observations are available in 2009 and 2012? How many have an observation in each year?

How do these relate to the testing degrees of freedom?

TipQuestion 20

Both appoached could be used for this case, but the outcomes are very different. Explain the differences between the tests and which outcome you think is more reliable and/or informative?

TipQuestion 21

An alternative way to test a paired sample it to compute the difference between each pair and run a test whether the mean of those values deviates significantly form 0. Calulate and run the test in that fashion and compare the outcomes.

Solution

Q19: p:0.07 > 0.05 so difference is not significant.

# nr of observations for each group
s1.dropna().count() #2569
s2.dropna().count() #2540

#inner merge omits any observations missing from either dataframs
pd.merge(s1.dropna(), s2.dropna(), left_index=True, right_index=True, how='inner').count() #2291

df indipendent = n(s1) + n(s2) - 2 = 2569 + 2540 -2 = 5107 df paired = n(s1 & s1) - 1 = 2291-1 = 2290

Q20: grainsize has large spatial differences (high variability in the data) and small changes between years, so independent testing makes little sense. By doing repeat samples at the same location we can remove the spatial variability and only check how much the change is for matched locations this strongly reduces test variance and allows us to detect the changes without being obscured by high variance.

Q20:

d12 = s2 - s1
stats.ttest_1samp(d12, 0, nan_policy='omit')

We have now experimented with quite a large number of hypothesis tests for this dataset. you have to create a report about this analysis but have to keep it fairly brief, which of the test results would you include? explain your choice

For data exploration freely playing with the data and making many plots and analyses is quite ok, but in hypothesis testing (or statistical modelling we do on later sessions) running many statistical analyses and choosing a few to report on is a big faux pas in statistics (especially when selecting only the most significant results from many). This is sometimes labeled at ‘p-hacking’.

TipQuestion 22

Explain the main reasons why selectively reporting in statistical results after running them is scientifically not sound.

Solution

Q22: if we make selections after knowing the outcome the chosen alpha no longer represents our results. https://xkcd.com/882/

a 5% chance of rejecting a valid H0 is often considered acceptable, but if we run multiple test our chance of hitting (and reporting on) a type I error increases a lot

TipQuestion 23

The loaded Sibes dataset has many more data fields (both numerical (biomass, abundance) and categories (tidal basin, tidal flat), other years of data if we make a different subset). Think of one scientific question that you find (at least somewhat) interersting to answer with this data, run and explain the test.

What is your question Which values/groups do you use? Give the formal statistical hypothesis what do the outcomes (both H0 accept and H0 reject) mean in your own words. What are the test statistics and explain the result.

Solution Q23: everyone has their own results. but the code should run a working test and you have provided an interpretation of the results both formally (eg H0 rejected, p=0.0##) and have a real life interpretation.