Using distributions to analyse your data

Background to this example

Data comes in all sorts and forms within Earth sciences, from long term paleo records describing Oxygen levels in the atmosphere, time series of river discharge and spatio-temporal satellite images monitoring the vegetation. Within Earth Sciences we work with all these types of data to understand the past, present and future of the Earth system. Before we can work with these types of data we need to understand what we can and cannot do with the data, which conclusion we can and cannot draw.

In this practical you will learn about

  • Simple plots to look at your data

  • How to see which distribution can be used to describe the data

  • Which measures you can use to describe a distribution

  • How to check exceedance of thresholds

  • How to work with small samples and what distributions to use in that case

Getting started

Let’s start with using Python again by opening your Conda environment and then opening Spyder (for detailed instructions please look back at the first practical). We start by loading some of the stand libraries in this course. We use:

  • Pandas (data management and data handling)

  • Numpy (statistical analysis and data handling)

  • Matplotlib (plotting)

  • Scipy (statistical analysis)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import scipy.stats as ss

Now we are going to take a look at the first dataset which contains information about the daily temperatures in the Netherlands. We tell pandas to parse the date information, and use it as row labels:

Tas = pd.read_csv("../Data/dailyTemperature.csv", parse_dates=True, index_col=0).dropna()
Precip = pd.read_csv("../Data/dailyPrecipitation.csv", parse_dates=True, index_col=0).dropna()
Evap = pd.read_csv("../Data/dailyEvaporation.csv", parse_dates=True, index_col=0).dropna()

Normal or not

Let’s start with exploring different data sources, we want to identify which distributions they follow.

TipQuestion 1

Let’s start with a visual inspection. A good way to start is to make a histogram of you data. Use the cheat sheet to explore which function to use to make a histograms of the different meteorological variables.

Solution

`TODO

Tas.plot(kind="hist")

Precip.plot(kind="hist")

Evap.plot(kind="hist")

As you might have seen some distributions are unlikely to be normal, but we have another way of testing this using the QQ-plot.

TipQuestion 2

Let’s continue with a second visual inspection. Use your knowledge from last week (or look back) to make a QQ plot of the meteorological variables. What is your conclusion based on this analysis? Are the variables normally distributed?

Solution
# Calculate the quantiles of the precipitation data
real_quantiles = Tas["Tas"].quantile(q=np.linspace(0.01, 0.99, 100))

# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))

# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")

# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()

# Calculate the quantiles of the precipitation data
real_quantiles = Precip["Precip"].quantile(q=np.linspace(0.01, 0.99, 100))

# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))

# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")

# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()

# Calculate the quantiles of the precipitation data
real_quantiles = Evap["Evap"].quantile(q=np.linspace(0.01, 0.99, 100))

# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))

# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")

# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()

It looks like only temperature data might have a normal distribution, both precipitation and evaporation are skewed and do not follow a normal dsitribution, which is also confirmed by question 1
TipQuestion 3

Try to implement the Shapiro Wilks test in Python on the different datasets. You can use the function stats.shapiro, for which you find the manual here. How do you interpret the return values of the test (p value and statistic)?

Solution
stats.shapiro(Tas)
/home/runner/.local/lib/python3.12/site-packages/scipy/stats/_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 45097.
  res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.9915394668801999), pvalue=np.float64(1.947525866153211e-43))
stats.shapiro(Precip)
/home/runner/.local/lib/python3.12/site-packages/scipy/stats/_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 43241.
  res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.5614171807011161), pvalue=np.float64(7.680685917849244e-134))
stats.shapiro(Evap)
/home/runner/.local/lib/python3.12/site-packages/scipy/stats/_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 24462.
  res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.8959951840097614), pvalue=np.float64(4.9559581114372047e-82))

Based on these numbers non of the distribution is normal

TipQuestion 4

Based on the different test, histogram, QQ-plot and statistical test with Shapiro Wilks, what are your conclusions? They do not have to align for the different tests.

Solution Based on the visual inspections (histogram and QQplot) we can conclude that temperature might be normally distributed, but this is not confirmed by the Shapiro test. The big questions to reflect on is if you would nonetheless assume a normal distribution for the temperature or use a more strict selection.

We can also derive two mean statistical properties for the temperature dataset, namely the mean and standard deviation.

Exploring the temperature data

TipQuestion 5

Obtain the mean and standard deviation for the temperature record.

Solution
Tas.describe()
Tas
count 45097.000000
mean 9.582449
std 6.288493
min -14.900000
25% 5.100000
50% 9.800000
75% 14.500000
max 29.700000

Or

Tas.mean()
Tas    9.582449
dtype: float64
Tas.std()
Tas    6.288493
dtype: float64

For simplicity, we now assume our data is normally distributed. Now that we got the mean and standard deviation we can calculate the possibility of daily temperature exceeding certain thresholds. You can use the stats.norm.cdf() function for this, and then provide the mean and standard deviation you found earlier. Make sure you understand what the number returned by cdf (cumulative distribution function) represents.

TipQuestion 6

What is the chance of having a daily temperature above or equal to 20C. Use the CDF function (as described above) and estimate the fraction based on the data (please use an emperical function like value_counts, count or something else). Is there a difference, and if so what causes this difference

Solution
warm = Tas[Tas >= 20]
warm.value_counts()
Tas 
20.0    66
20.5    59
20.1    57
20.6    55
20.2    55
        ..
27.7     1
29.7     1
26.7     1
26.1     1
27.0     1
Name: count, Length: 73, dtype: int64

Or

warm = Tas[Tas >= 20]
warm
warm.describe()
warm.count()
Tas    1436
dtype: int64

Using the distribution

tasMean = Tas.mean()
tasSD = Tas.std()
1- stats.norm.cdf(20, tasMean, tasSD)
array([0.04879965])

If all is well you found that in roughly 5% of the days we have temperatures above 20C in the Netherlands. If this number seems low to you, mind you that we are here talking daily average temperatures and not maximum temperatures that are often reported on the news.

A question we often hear in the media, is climate change already happening and is the world really getting warmer? Let’s test that for the Netherlands.

TipQuestion 7

Select the last 20 years on the record and compare those to the entire record that we have. First select the years, calculate the new mean and standard deviation and then see if we are more likely to observe temperatures above 20C compared to the full record from 1901 onwards. Only use the fitted normal distribution for this analysis.

Solution
tasMeanNew = Tas.loc["2004":"2023"].mean()
tasSDNew = Tas.loc["2004":"2023"].std()
1- stats.norm.cdf(20, tasMeanNew, tasSDNew)
array([0.07005082])

If you did it well you find that the probability of Tas >= 20C has increased from 5% to 7%. So far we have done this mostly with the statistical approximation of the data (e.g. a fitted normal distribution), however we also have the real data available. As a tip look at the startingPandas practical on how to test an entire dataframe for a certain condition.

TipQuestion 8

Count the number of days that exceed or equal >= 20C for the entire record and the last 20 years and convert these into fractions.

Solution
warm = Tas[Tas >= 20]
warmCount = warm.count()
print(warmCount)
print(warmCount/len(warm))
Tas    1436
dtype: int64
Tas    0.031842
dtype: float64
warm = Tas[Tas >= 20].loc["2004":"2023"]
warmCount = warm.count()
print(warmCount)
print(warmCount/len(Tas.loc["2004":"2023"]))
Tas    420
dtype: int64
Tas    0.057495
dtype: float64

If you did it well you find that in the total record we have 1436 days with temperatures >= 20C for a total of 45097 days, which results in 3.2% of the days. For the last 20 years we have 420 days >= 20C for a total record length of 7305 which results in 5.7% of the days. As you can see these numbers don’t match the once you found for Question 7.

TipQuestion 9

Using the QQ plot you made earlier, can you see a reason why the normal distribution is not doing a good job for this assessment, especially for the high values? Would you expect that the answers for Questions 7 and 8 are closer when looking at <= 0C?

Solution There is a smaller deviation between the observations and the theoretical 1:1 line so we would expect the values <= 0C to better follow the normal distribution.
TipQuestion 10

Repeat Questions 6, 7 and 8 but now for <= 0C. Was your initial estimate correct?

Solution
cold = Tas[Tas <= 0]
coldCount = cold.count()
coldCount/len(cold)
Tas    0.066612
dtype: float64

Using the distribution

tasMean = Tas.mean()
tasSD = Tas.std()
stats.norm.cdf(0, tasMean, tasSD)
array([0.06377848])

Or for the last 20 years

cold = Tas[Tas <= 0].loc["2004":"2023"]
coldCount = cold.count()
coldCount/len(cold)
Tas    0.042026
dtype: float64

Using the distribution

tasMean = Tas.loc["2004":"2023"].mean()
tasSD = Tas.loc["2004":"2023"].std()
stats.norm.cdf(0, tasMean, tasSD)
array([0.03955379])

If all went well you find that the chance of <= 0C temperatures is around 6.5% for the entire data record (depending on the method used) and 4.2% for the last 20 years.

TipQuestion 11

Can you conclude from your assessment of Tas <= 0C that temperatures are rising and that iceskating becomes less and less likely? Or would you like to propose additional calculations or data requirements?

Solution For some proper iceskating we also need to include the temporal probability, we need multiple days in a row with temperatures below 0C. So not only do we need to check the unconditional probability but also the chance of have multiple freezing days

Using the same weather data the KNMI did an in-depth assessment, you can compare your answers to what they found and see if your finding are in line.

Exploring precipitation data

As you could already see from the Questions 2,3 and 4 the precipitation data is certainly not normally distributed, it actually follows a different type of distribution. You have learned a distribution has multiple moments.

TipQuestion 12

Obtain the first three moments for the daily precipitation data and don’t forget to use the dropna() function to ensure that you can calculate those. You will have to rely on the Scipy functions for a selection of summary statistics for the higher order moments. Tip: you can use the seperate function for the moments or otherwise a function that directly describes a set of moments.

Solution

The first three moments of a distrbution are the mean, variance, skewness. You can calculate those by using the code below

Precip = pd.read_csv("../Data/dailyPrecipitation.csv", parse_dates=True, index_col=0).dropna()
np.mean(Precip)
np.float64(2.202092921070281)
np.var(Precip)
np.float64(19.26110637105162)
stats.skew(Precip)
array([3.66369029])

Alternatively you could use the desrcibe function from Scipy.

stats.describe(Precip)
DescribeResult(nobs=np.int64(43241), minmax=(array([0.]), array([63.9])), mean=array([2.20209292]), variance=array([19.26155182]), skewness=array([3.66369029]), kurtosis=array([20.57153234]))

You find that the precipitation violates the normal distribution properties and is highly skewed. As a result precipitation often follows a gamma distribution, which is bounded by zero with a long tail. In Python you can fit any kind of distribution to your data and check if it makes sense. Below you find example to fit both a normal and gamma distribution

mean, std = stats.norm.fit(Precip)
randomData = stats.norm.rvs(mean, std, size= len(Precip))
plt.hist(Precip, alpha=0.5)
plt.hist(randomData, alpha=0.5)
plt.show()

alpha, loc, beta = stats.gamma.fit(Precip)
randomData = stats.gamma.rvs(alpha, loc, beta, size= len(Precip))
plt.hist(Precip, alpha=0.5)
plt.hist(randomData, alpha=0.5)
plt.show()

You can see straight away that fitting the normal distribution is not working for this example and the gamma distribution is more suited for this purpose. We will not further explore this specific gamma distribution, but you know now how to obtain the parameters for any distribution in Python. We will now however further explore the annual maximum precipitation event.

TipQuestion 13

Start by resampling the precipitation data to obtain the annual maximum value using the resample function. After that make a histogram of the data and describe what you see.

Solution
annualMaxPrecip = Precip.resample("YE").max()
### Or for older Python versions
#annualMaxPrecip = Precip.resample("Y").max()
annualMaxPrecip.plot(kind="hist")

The annual maximum precipitation is not normally distributed but is skewed towards lower values, there are some extreme outliers as well
TipQuestion 14

Does the annual maximum precipitation reflect a normal distribution or not?

Solution
# Calculate the quantiles of the precipitation data
real_quantiles = annualMaxPrecip["Precip"].quantile(q=np.linspace(0.01, 0.99, 100))

# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))

# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")

# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()

stats.shapiro(annualMaxPrecip)
ShapiroResult(statistic=np.float64(0.8985685894648064), pvalue=np.float64(1.825672009377136e-07))
Based on these results we conclude that the data is not normally distributed
TipQuestion 15

Now transform the data to a logarithmic and make a histogram plot. Does it now reflect a normal distribution?

Solution
annualMaxPrecip["logAnnualMaxPrecip"] = np.log10(annualMaxPrecip)
annualMaxPrecip["logAnnualMaxPrecip"].plot(kind="hist")

Or more directly in the plotting function

annualMaxPrecip["Precip"].plot(kind="hist", logx=True)

If all is well you find that the logarithmic annual maximum precipitation does approach a normal distribution. The advantage of this is that we can now use all the tools and test we have available for the normal distribution to test different hypothesis on the data.

Exploring binominal and Poisson distributions

We can also ask ourselves the question how often does “x” happen, where “x” is a certain event or extreme. Let’s go back to the temperature data and the iceskating example. In the Netherlands people are always concerned whether or not we can have a Elf-stedentocht or when we can go skating in general. Use the variable you created for Question 11, which should be a dataframe that has True and False values from 1901 onwards. We have already concluded that there is 6.5% chance of obtaining a day <= 0C. Let’s assume that this value is the true chance of having a <= 0C. We can start drawing random samples from the full dataset.

 ## To ensure we get the same random samples and thus the same answers.
np.random.seed(10)
sampleSize = 100
sampleData = Tas.sample(sampleSize)
fraction = sampleData[Tas <= 0].count()/sampleSize
print(fraction)
Tas    0.09
dtype: float64

You find that the fraction is higher than the expected value of 6.5% or 0.065. This is because our sample size is insufficient to correctly reflect the data observational record. Again this confirms that sample size is very relevant when doing your work, getting enough data is key in the real world and Geosciences as well. We will explore how many samples we need using a for-loop.

## To ensure we get the same random samples and thus the same answers.
np.random.seed(10)
## Sample size
samplePoints = np.arange(5,40000,20)
## Empty array for the output
fraction = np.zeros(len(samplePoints))

## For-loop that loops over the samplePoints and at the same time over index numbers ranging from 0 to the length of the samplePoint variable
for s, i in zip(samplePoints, range(len(samplePoints))):
  sampleData = Tas.sample(s)
  fraction[i] = sampleData[Tas <= 0].count().iloc[0]/s
  
## Plotting the output with log x-axis
plt.plot(samplePoints, fraction, "-")
plt.xscale('log')
plt.show()

You will observe that you need a rather large sample size to obtain the “True” value of 6.5% and that it is still converging even with 40000 samples present.

TipQuestion 16

Do you think it will take longer or shorter to obtain the “True” exceedance of a value if you take a threshold with a higher probability, for example 10C. Test this with the script above and provide the figure

Solution
## To ensure we get the same random samples and thus the same answers.
np.random.seed(10)
## Sample size
samplePoints = np.arange(5,40000,20)
## Empty array for the output
fraction = np.zeros(len(samplePoints))

## For-loop that loops over the samplePoints and at the same time over index numbers ranging from 0 to the length of the samplePoint variable
for s, i in zip(samplePoints, range(len(samplePoints))):
  sampleData = Tas.sample(s)
  fraction[i] = sampleData[Tas <= 10].count().iloc[0]/s
  
## Plotting the output with log x-axis
plt.plot(samplePoints, fraction, "-")
plt.xscale('log')
plt.show()

We can also ask ourselves the question how many days will we get in the coming year that have temperature <= 0C. For this we can use the Poisson distribution to generate likely distributions of freezing temperatures in every year. We know from the previous analysis that we have 3004 days that <= 0C for a total record length of 45097 days, which mean on average we have 3004/45097*365.25 = 24.31 days with <= 0C per year on average since 1901. Now lets plot a Poisson distribution.

np.random.seed(10)
avgDays = Tas[Tas <= 0].count().iloc[0]/len(Tas)*365
print(avgDays)
randomYears = stats.poisson.rvs(avgDays, size=1000)
plt.hist(randomYears)
24.313368960241256
(array([  5.,  26.,  86., 194., 223., 278., 127.,  45.,  12.,   4.]),
 array([ 9. , 12.2, 15.4, 18.6, 21.8, 25. , 28.2, 31.4, 34.6, 37.8, 41. ]),
 <BarContainer object of 10 artists>)

Or if you want to directly calculate the probabilities you can do the following:

x = np.arange(5,45)
y = stats.poisson.pmf(x, avgDays)
plt.plot(x, y, '-')

TipQuestion 17

We can also use the Poisson distribution to ask questions like what is the likelihood of having less than 15 freezing day? Use the stats.poisson.cdf(k, mu) to find the chance of having less than 15 freezing days in a given year.

Solution
avgDays = Tas[Tas <= 0].count().iloc[0]/len(Tas)*365
stats.poisson.cdf(15, avgDays)
np.float64(0.03009273491856656)
On average this is about 3% of the time
TipQuestion 18

If you have paid carefull attention you might have noticed that in this case the Poisson distribution is approaching a normal distribution. This means again that we the Central Limit Theory comes into play. Now please lower the probability of 24.31 to something like 3 days per year does this still approach a normal distribution and if not why not?

Solution
np.random.seed(10)
avgDays = 3
print(avgDays)
randomYears = stats.poisson.rvs(avgDays, size=1000)
plt.hist(randomYears)
3
(array([ 59., 164., 229., 208., 158., 104.,  49.,  17.,   9.,   3.]),
 array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]),
 <BarContainer object of 10 artists>)

Or if you want to directly calculate the probabilities you can do the following:

x = np.arange(0,8)
y = stats.poisson.pmf(x, avgDays)
plt.plot(x, y, '-')

We see now that the distribution no longer approaches a normal distribution

Using a t-distribution

When we are working with problems that have small samples size, for example the field observations where sample sizes might be as low as several measurements maybe up to a 100. Since in these measurements the true variance of the mean is unknown, it must be approximated by the sample standard error of the mean. And the ratio between the sample mean and the standard error had a distribution that is called the t-distribution, and converges for larger values towards the normal distribution.

We can now use both the normal and t-distribution for the Temperature dataset with the fitting information from Question 12. We first fit both a t-distribution with a limited sample and a normal distribution and then plot the probability density function (PDF) for both distributions.

sampleTas = Tas.sample(15, random_state=1)
tStat, loc, scale = stats.t.fit(sampleTas)
degreeF = len(sampleTas) - 2
mean, std = stats.norm.fit(sampleTas)

xLocs = np.arange(Tas.min().iloc[0], Tas.max().iloc[0], 0.05)

plt.plot(xLocs, stats.t.pdf(xLocs, degreeF, loc, scale), "-", color = "Red", label="t-dist")
plt.plot(xLocs, stats.norm.pdf(xLocs, mean, std), "-", color = "Blue", label="normal-dist")
plt.legend()
plt.show()

And again for the full sample of data.

tStat, loc, scale = stats.t.fit(Tas)
mean, std = stats.norm.fit(Tas)
degreeF = len(Tas) - 2
mean, std = stats.norm.fit(Tas)

xLocs = np.arange(Tas.min().iloc[0], Tas.max().iloc[0], 0.05)

plt.plot(xLocs, stats.t.pdf(xLocs, degreeF, loc, scale), "-", color = "Red", label="t-dist")
plt.plot(xLocs, stats.norm.pdf(xLocs, mean, std), "-", color = "Blue", label="normal-dist")
plt.legend()
plt.show()

TipQuestion 19

Do you think there is a significant difference between the t-distribution and the normal distribution for small samples? And for the large sample?

Solution For smaller samples we do see the heavier tails, however if we provide more datapoints than we see that it slowly approaches the normal distribution as the theory tells us

Now we will run the same experiment on a smaller data sample. First we create the subset of the temperature data.

np.random.seed(10)
sampleSize = 100
sampleData = Tas.sample(sampleSize)
TipQuestion 20

What happens if you use a smaller subset of the temperature data, does the t-distribution still approximates the normal distribution of the full dataset? Adjust the codes about to show the Cumulative Density Function (CDF) of the sampleData and describe your interpretation of the CDF.

Solution
Tas.hist(cumulative=True, density=True, histtype='step',label="Full set")
array([[<Axes: title={'center': 'Tas'}>]], dtype=object)

sampleSize = 100
sampleData = Tas.sample(sampleSize)
sampleData.hist(cumulative=True, density=True, histtype='step',label="sample set")
array([[<Axes: title={'center': 'Tas'}>]], dtype=object)

*Or if you want to make it into one figure**

plt.hist(Tas, cumulative=True, density=True, histtype='step',label="Full set")
sampleSize = 100
sampleData = Tas.sample(sampleSize)
plt.hist(sampleData, cumulative=True, density=True, histtype='step',label="sample set")
plt.legend()
plt.show()

Do your own data exploration

Q = pd.read_csv("../Data/dailyDischarge.csv", parse_dates=True, index_col=0).dropna()
TipQuestion 21

You have loaded the daily discharge data for the Rhine river, now start and exploring. You should make the following things:

  • Some plots that allow you to understand what the data looks like

  • Describe some relevant statistical values

  • Describe if the data is normal or not and why

  • Answer the questions:

    • How often do we get floods above 8000m3s-1

    • How often do we go below 1000m3s-1

  • Calculate the observed exceedance for flood and drought per year

    • Tip you can calculate the annual occurrence by comparing the values to your threshold

    • Then calculate the number of times you actually have an event

    • Flood: Q[Q>8000].resample(“YE”).count()

    • Drought: Q[Q<1000].resample(“YE”).count()

    • Then plot the histogram of both selections to look at the distribution (also needed for the next sub-question)

  • Create a Poisson distribution for both flood and drought

  • Can you hypothesis why the Poisson distribution doesn’t work for the drought example?

Solution

Some relevant plots for a timeseries and for the histogram

Q.plot()

Q.plot(kind="hist")

Computing the moments of this distribution using Scipy

stats.describe(Q)
DescribeResult(nobs=np.int64(44771), minmax=(array([575.]), array([13000.])), mean=array([2200.07870753]), variance=array([1270596.22239363]), skewness=array([2.10692705]), kurtosis=array([6.9744782]))

Make a QQ plot and testing with Shapiro

# Calculate the quantiles of the precipitation data
real_quantiles = Q["Q"].quantile(q=np.linspace(0.01, 0.99, 100))

# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))

# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")

# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()

stats.shapiro(Q)
/home/runner/.local/lib/python3.12/site-packages/scipy/stats/_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 44771.
  res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.8310783131039213), pvalue=np.float64(5.946077911910897e-108))

Flood above 8000m3s-1 per year

flood = Q[Q >= 8000].count().iloc[0]
print(flood)
122

Droughts below 1000m3s-1 per year

drought = Q[Q <= 1000].count().iloc[0]
print(drought)
2199

First we determine the amount of days on average we have a flood and then plot the observed distribution

floodDays = flood/len(Q)*365
print(floodDays)
floodYears = Q[Q>8000].resample("YE").count()
plt.hist(floodYears)
0.994617051216189
(array([99.,  2.,  7.,  7.,  2.,  1.,  4.,  0.,  0.,  1.]),
 array([ 0. ,  1.3,  2.6,  3.9,  5.2,  6.5,  7.8,  9.1, 10.4, 11.7, 13. ]),
 <BarContainer object of 10 artists>)

Now the same for the droughts

droughtDays = drought/len(Q)*365
print(droughtDays)
droughtYears = Q[Q<1000].resample("YE").count()
plt.hist(droughtYears)
17.927564718232784
(array([92., 12.,  8.,  3.,  2.,  4.,  0.,  1.,  0.,  1.]),
 array([  0. ,  20.2,  40.4,  60.6,  80.8, 101. , 121.2, 141.4, 161.6,
        181.8, 202. ]),
 <BarContainer object of 10 artists>)

Two methods again for the flood poisson distribution

randomYears = stats.poisson.rvs(floodDays, size=1000)
plt.hist(randomYears)
(array([365., 384.,   0., 173.,   0.,  54.,  17.,   0.,   6.,   1.]),
 array([0. , 0.6, 1.2, 1.8, 2.4, 3. , 3.6, 4.2, 4.8, 5.4, 6. ]),
 <BarContainer object of 10 artists>)

x = np.arange(0,5)
y = stats.poisson.pmf(x, floodDays)
plt.plot(x, y, '-')

Two methods again for the drought poisson distribution

randomYears = stats.poisson.rvs(droughtDays, size=1000)
plt.hist(randomYears)
(array([  7.,  54., 149., 370., 224., 141.,  47.,   5.,   2.,   1.]),
 array([ 5. ,  8.3, 11.6, 14.9, 18.2, 21.5, 24.8, 28.1, 31.4, 34.7, 38. ]),
 <BarContainer object of 10 artists>)

x = np.arange(0,40)
y = stats.poisson.pmf(x, droughtDays)
plt.plot(x, y, '-')

For flood the histogram matched the shape of the Poisson distribution, while for droughts we see that the histogram doesn’t match the distribution as plotted by the Poisson distribution. This is most likely due to a strong temporal correlation in the drought events, which violates the assumption of no dependency between events. Flood events have a shorter temporal duration and thus a smaller autocorrelation. In other words for droughts we are violating one of the critical assumptions of the Poisson distributions.

What you have learned today

If all is well you have learned today:

  • Simple plots to look at your data

  • How to see which distribution can be used to describe the data

  • Which measures you can use to describe a distribution

  • How to check exceedance of thresholds

  • How to work with small samples and what distributions to use in that case