Sibes data & Testing means

Sibes data

For session 6 (and 8) we will work with data collected in the Dutch wadden sea as part of the SIBES program at NIOZ. This is an intensive monitoring program for benthic macrofauna (mostly shells and worms etc) that live in the tidal mudflats. Check out https://basismonitoringwadden.waddenzee.nl/datahuis-0/overzicht-meet-monitorprogramma/overzicht-meet-monitorprogramma/bodemfauna/b27-synoptic-intertidal-benthic-surveys/ for a more comprehensive description of the monitoring project and if you are interested the journal papers referenced on the website (Bijleveld et al., 2012 about the sampling design is especially relevant to our coursework)

Public datasets are available on the ‘wadden viewer’ which also has many other data sources for the Wadden sea. We are using two datasets specifically, the Sibes Biota and Sibes Sediment which are downloaded as .csv and available in the ../Data/ folder for the course.

Question 1

What sampling lay-out is used for the SIBES program?

Question 2

How large are is surface area (and volume?) of each sample?

Question 3

to what depth is the sediment composition sampled and analyzed? From what level is this depth referenced?

Hint

check the data description website and paper for this information. (you should always read up on the backgrounds, methods, etc of any data you receive so you understand what you are working with)

Loading and data preparation

We start our script with loading the required modules as always. Then the two datafiles are read and cast into the dataframes and data types in the way we like.

import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

#load the sediment data, set column names and types
rawdata  = pd.read_csv("../Data/locationsediment.csv")
sediment = pd.DataFrame({
    "geom": rawdata['geom'], 
    "date": pd.to_datetime(rawdata['date'], format='%Y-%m-%d'),
    "station_id": rawdata['sampling_station_id'], 
    "sample_id": rawdata['sample_id'], 
    "grainsize_median": rawdata['sediment_median'],
    "silt_proportion": rawdata['sediment_silt'],
     } )
# add year as standard numerical column for easy selection and merging
sediment['year'] = sediment['date'].dt.year

#load the biota data, set column names and types
rawdata  = pd.read_csv("../Data/locationbiota.csv")
biota = pd.DataFrame({
    "geom": rawdata['geom'], 
    "date": pd.to_datetime(rawdata['date'], format='%Y-%m-%d'),
    "sampling_type": rawdata['sampling_type'].astype("category"),
    "sample_platform": rawdata['platform'].astype("category"),
    "tidal_basin_name": rawdata['tidal_basin_name'].astype("category"),
    "flat_name": rawdata['flat_name'].astype("category"),
    "species": rawdata['name'].astype("category"),
    "species_type": rawdata['type'].astype("category"),
    "species_aphia_id": rawdata['aphia_id'].astype("category"),
    "abundance_m2": rawdata['abundance_m2'],
    "biomass_afdm_m2": rawdata['afdm_m2'],
    } )
biota['year'] = biota['date'].dt.year

Both datasets were collected during the same sampling campaigns, but made available in separate files for sediment information and biological (species) data. Check out the loaded dataframes. what are the sizes (columns, rows). which data do they have in common? are there empty or missing data (in which columns?). Do any columns contain lots of duplicated data?

Question 4

Especially in the biota dataframe we have set many columns with at ‘category’ data type. check out the pandas reference and cheatsheet. Explain why this datatype is suitable for the columns it it used on.

Question 5

Consider ‘tidy’ data structures. Are the biota and sediment tables in a long, or wide format?

Question 6

Since both datasets originate from the same samples could we merge the two directly into one dataframe? Explain the main hurdles to combining these data files for analysis.

Question 7

If we were to merge (or join) the tables, which columns would make suitable indices to combine the tables?

Since many of the categorical data columns do only vary from one sample to another (and not for each species) we can quite easily add this information to the sediment data by only matching the unique samples from biota to the sediment table.

# Merge the unique sample metadata of the biota table with the sediment table.
# nrow (sediment) == nrow(sibes_merged)
biota_metadata = biota[['year', 'geom', 'sampling_type',  'sample_platform', 'tidal_basin_name', 'flat_name']].drop_duplicates()
sibes_merged = pd.merge(left = sediment, 
                        right = biota_metadata, 
                        on = ['year', 'geom'], 
                        how ='left')

Check out the resulting dataframe. in very many years the biota data has no entries (only small parts of the area were fully processed since 2015) we have little interest in the samples that were only partly processed. We can make a selection on the years again as we did in the knmi, but we can also change the join type for the merge step so only observations with matching data are retained.

Question 8

What join type (instead of “how = ‘left’”) would we set to only retain observations with matching data in each table?

As seen, we cannot simply join the biomass values because they don’t fit 1:1 with the sediment table, but we can summerize them to the sample locations/years and use those field. As a quick first we can simply sum up all species to get the total biomass and abundance for each sample which does merge easily with the other data.

# total sums of biomass and abundance for all species
biota_sums = biota.groupby(['geom', 'year'], observed=True).sum(numeric_only=True)

# merge with the other dataframe (with sediment and biota metadata)
sibes_merged = pd.merge(left = sibes_merged, 
                        right =  biota_sums, 
                        on =['year', 'geom'], 
                        how ='inner')

# Now we reorder the dataframe to my desired column order just because it is prettier...
sibes_merged = sibes_merged[[
    'geom',
    'station_id',
    'date',
    'year',
    'sample_id',
    'sampling_type',  
    'sample_platform', 
    'tidal_basin_name', 
    'flat_name',
    'grainsize_median',
    'silt_proportion',
    'abundance_m2',
    'biomass_afdm_m2'
]]

Explore the dataset

We now have a prepared data table we can further explore. have a look at the tables, make some summaries, do quick histograms etc. Categorical variables can be be quickly used to group the data (.groupby) to explore different groupings of the dataset

# make a quick plot of all variables
sibes_merged.hist()

# summary of median grainsize for each tidal basin
sibes_merged[['grainsize_median']].groupby(sibes_merged['tidal_basin_name']).describe()

Question 9

of the sediment (grainsize and silt proportion) and biota (biomass and abundance) data, which variables look approximately normally distributed? what distributions do the other variables have?

Question 10

Use the .groupby and other relevant function let python show you how many samples are available in each tidal basin? What fraction of the samples was taken by boat (or on foot) for each basin?

Moving to statistical testing we will limit the datasets to a few basins and only a single year to make the outcomes more quickly interpreted and having repeated samples may quickly violate assumptions of independence in the data. We will use the westside of the waddensea (near Texel and Vlieland) for 2009 for now.

# now we can quite easily select just a few basins on the westside of the waddensea to run som statistics on and have the outputs still quickly interpretable
sibes = sibes_merged.loc[sibes_merged['tidal_basin_name'].isin(["Marsdiep", "Eierlandse Gat", "Vlie", "Borndiep"] ) , : ]
sibes = sibes.loc[sibes['year'] == 2009, ]
# cleanup empty categories as they will otherwise keep showing up in all analses
sibes['tidal_basin_name'] = sibes['tidal_basin_name'].cat.remove_unused_categories()

Question 11

How many samples do we have in our subset? Do you expect this to be sufficient to detect significant differences between groups if we further analyse the data?

Testing of group means

We now have data ready lets compare some group means by running a t-test…

# Define two groups (sampled by boat and on foot)
s1 = sibes.loc[sibes['sample_platform']=="boat",'grainsize_median'] 
s2 = sibes.loc[sibes['sample_platform']=="walk",'grainsize_median'] 
# Quick visual comparison by histogram
plt.hist(s1, density=True, alpha=0.5)
plt.hist(s2, density=True, alpha=0.5)
plt.show()
# The t-test
print(stats.ttest_ind(s1, s2))

Question 12

Check out the result of the t-test. Was the test significant? what does that even mean? Do you understand what and how we tested?

Solution

Q12: p << 0.001, so very significant… probably no clue what we did or what it means :-)

We have seen that is trivially easy to run a t-test (or for that matter nearly any statistical test) on your dataset using python and it will spit out numbers. It is still up to you to choose (and justify) the right method and interpret the outcomes (and explain them to your audience).

Lets go back to the test above for median grainsize between ‘boat’ and ‘walk’ samples.

Question 13

What are the assumptions the data need to meet to allow for a t-test. Are they met in median grainsize for the two sampling platforms?

Question 14

Write out the formal testing hypotheses (H0 and H1) explain in common language what the interpretation is of rejecting H0.

Question 15

for these exercises we will use alpha=0.05 as significance level. Explain what this value means. What are other commonly used alpha values?

Question 16

what are the sample means and standard deviation for these groups? how much is the difference between the means?

run the same test for the grid vs random samples in the dataset, what are the group means and sd?

Question 17

is the difference between these groups significant?

Question 18

These are pretty close… lets try and test differences between the grid and random groups, but now for silt proportion (which is not normally distributed) what test would you use here?

run the test and explain the outcomes

When we have for example repeated measurements of the same samples (or in our case repeated sampling at the same locations), those samples are no longer independent. We can look at differences at matched locations by running a paired test. We will run this test to compare median_grainssize for 2009 and 2012.

# We need to go back a bit in the code to redefine our datasets with all years.
sibes = sibes_merged.loc[sibes_merged['tidal_basin_name'].isin(["Marsdiep", "Eierlandse Gat", "Vlie", "Borndiep"] ) , : ]
sibes['tidal_basin_name'] = sibes['tidal_basin_name'].cat.remove_unused_categories()
sibes.info()

# with the .pivot functions we can make a table with only median grainsize, but one column for each year
grainsize_byyear = sibes.pivot_table(
    index='station_id', 
    columns='year', 
    values='grainsize_median', 
    observed=True
)

# create your groups s1 / s2 and run your test. 
s1 = grainsize_byyear[2009]
s2 = grainsize_byyear[2012]
print(stats.ttest_ind(s1, s2, nan_policy='omit') )
print(stats.ttest_rel(s1, s2, nan_policy='omit') )

Question 19

How many valid observations are available in 2009 and 2012? How many have an observation in each year?

How do these relate to the testing degrees of freedom?

Question 20

Both appoached could be used for this case, but the outcomes are very different. Explain the differences between the tests and which outcome you think is more reliable and/or informative?

Question 21

An alternative way to test a paired sample it to compute the difference between each pair and run a test whether the mean of those values deviates significantly form 0. Calulate and run the test in that fashion and compare the outcomes.

We have now experimented with quite a large number of hypothesis tests for this dataset. you have to create a report about this analysis but have to keep it fairly brief, which of the test results would you include? explain your choice

For data exploration freely playing with the data and making many plots and analyses is quite ok, but in hypothesis testing (or statistical modelling we do on later sessions) running many statistical analyses and choosing a few to report on is a big faux pas in statistics (especially when selecting only the most significant results from many). This is sometimes labeled at ‘p-hacking’.

Question 22

Explain the main reasons why selectively reporting in statistical results after running them is scientifically not sound.

Question 23

The loaded Sibes dataset has many more data fields (both numerical (biomass, abundance) and categories (tidal basin, tidal flat), other years of data if we make a different subset). Think of one scientific question that you find (at least somewhat) interersting to answer with this data, run and explain the test.

What is your question Which values/groups do you use? Give the formal statistical hypothesis what do the outcomes (both H0 accept and H0 reject) mean in your own words. What are the test statistics and explain the result.