import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import scipy.stats as ssUsing distributions to analyse your data
Background to this example
Data comes in all sorts and forms within Earth sciences, from long term paleo records describing Oxygen levels in the atmosphere, time series of river discharge and spatio-temporal satellite images monitoring the vegetation. Within Earth Sciences we work with all these types of data to understand the past, present and future of the Earth system. Before we can work with these types of data we need to understand what we can and cannot do with the data, which conclusion we can and cannot draw.
In this practical you will learn about
Simple plots to look at your data
How to see which distribution can be used to describe the data
Which measures you can use to describe a distribution
How to check exceedance of thresholds
How to work with small samples and what distributions to use in that case
Getting started
Let’s start with using Python again by opening your Conda environment and then opening Spyder (for detailed instructions please look back at the first practical). We start by loading some of the stand libraries in this course. We use:
Pandas (data management and data handling)
Numpy (statistical analysis and data handling)
Matplotlib (plotting)
Scipy (statistical analysis)
Now we are going to take a look at the first dataset which contains information about the daily temperatures in the Netherlands. We tell pandas to parse the date information, and use it as row labels:
Tas = pd.read_csv("../Data/dailyTemperature.csv", parse_dates=True, index_col=0).dropna()
Precip = pd.read_csv("../Data/dailyPrecipitation.csv", parse_dates=True, index_col=0).dropna()
Evap = pd.read_csv("../Data/dailyEvaporation.csv", parse_dates=True, index_col=0).dropna()Normal or not
Let’s start with exploring different data sources, we want to identify which distributions they follow.
Let’s start with a visual inspection. A good way to start is to make a histogram of you data. Use the cheat sheet to explore which function to use to make a histograms of the different meteorological variables.
Solution
`TODO
Tas.plot(kind="hist")
Precip.plot(kind="hist")
Evap.plot(kind="hist")
As you might have seen some distributions are unlikely to be normal, but we have another way of testing this using the QQ-plot.
Let’s continue with a second visual inspection. Use your knowledge from last week (or look back) to make a QQ plot of the meteorological variables. What is your conclusion based on this analysis? Are the variables normally distributed?
Solution
# Calculate the quantiles of the precipitation data
real_quantiles = Tas["Tas"].quantile(q=np.linspace(0.01, 0.99, 100))
# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))
# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")
# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()
# Calculate the quantiles of the precipitation data
real_quantiles = Precip["Precip"].quantile(q=np.linspace(0.01, 0.99, 100))
# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))
# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")
# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()
# Calculate the quantiles of the precipitation data
real_quantiles = Evap["Evap"].quantile(q=np.linspace(0.01, 0.99, 100))
# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))
# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")
# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()
Try to implement the Shapiro Wilks test in Python on the different datasets. You can use the function stats.shapiro, for which you find the manual here. How do you interpret the return values of the test (p value and statistic)?
Solution
stats.shapiro(Tas)/home/runner/.local/lib/python3.12/site-packages/scipy/stats/_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 45097.
res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.9915394668801999), pvalue=np.float64(1.947525866153211e-43))
stats.shapiro(Precip)/home/runner/.local/lib/python3.12/site-packages/scipy/stats/_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 43241.
res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.5614171807011161), pvalue=np.float64(7.680685917849244e-134))
stats.shapiro(Evap)/home/runner/.local/lib/python3.12/site-packages/scipy/stats/_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 24462.
res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.8959951840097614), pvalue=np.float64(4.9559581114372047e-82))
Based on these numbers non of the distribution is normal
Based on the different test, histogram, QQ-plot and statistical test with Shapiro Wilks, what are your conclusions? They do not have to align for the different tests.
Solution
Based on the visual inspections (histogram and QQplot) we can conclude that temperature might be normally distributed, but this is not confirmed by the Shapiro test. The big questions to reflect on is if you would nonetheless assume a normal distribution for the temperature or use a more strict selection.We can also derive two mean statistical properties for the temperature dataset, namely the mean and standard deviation.
Exploring the temperature data
Obtain the mean and standard deviation for the temperature record.
Solution
Tas.describe()| Tas | |
|---|---|
| count | 45097.000000 |
| mean | 9.582449 |
| std | 6.288493 |
| min | -14.900000 |
| 25% | 5.100000 |
| 50% | 9.800000 |
| 75% | 14.500000 |
| max | 29.700000 |
Or
Tas.mean()Tas 9.582449
dtype: float64
Tas.std()Tas 6.288493
dtype: float64
For simplicity, we now assume our data is normally distributed. Now that we got the mean and standard deviation we can calculate the possibility of daily temperature exceeding certain thresholds. You can use the stats.norm.cdf() function for this, and then provide the mean and standard deviation you found earlier. Make sure you understand what the number returned by cdf (cumulative distribution function) represents.
What is the chance of having a daily temperature above or equal to 20C. Use the CDF function (as described above) and estimate the fraction based on the data (please use an emperical function like value_counts, count or something else). Is there a difference, and if so what causes this difference
Solution
warm = Tas[Tas >= 20]
warm.value_counts()Tas
20.0 66
20.5 59
20.1 57
20.6 55
20.2 55
..
27.7 1
29.7 1
26.7 1
26.1 1
27.0 1
Name: count, Length: 73, dtype: int64
Or
warm = Tas[Tas >= 20]
warm
warm.describe()
warm.count()Tas 1436
dtype: int64
Using the distribution
tasMean = Tas.mean()
tasSD = Tas.std()
1- stats.norm.cdf(20, tasMean, tasSD)array([0.04879965])
If all is well you found that in roughly 5% of the days we have temperatures above 20C in the Netherlands. If this number seems low to you, mind you that we are here talking daily average temperatures and not maximum temperatures that are often reported on the news.
A question we often hear in the media, is climate change already happening and is the world really getting warmer? Let’s test that for the Netherlands.
Select the last 20 years on the record and compare those to the entire record that we have. First select the years, calculate the new mean and standard deviation and then see if we are more likely to observe temperatures above 20C compared to the full record from 1901 onwards. Only use the fitted normal distribution for this analysis.
Solution
tasMeanNew = Tas.loc["2004":"2023"].mean()
tasSDNew = Tas.loc["2004":"2023"].std()
1- stats.norm.cdf(20, tasMeanNew, tasSDNew)array([0.07005082])
If you did it well you find that the probability of Tas >= 20C has increased from 5% to 7%. So far we have done this mostly with the statistical approximation of the data (e.g. a fitted normal distribution), however we also have the real data available. As a tip look at the startingPandas practical on how to test an entire dataframe for a certain condition.
Count the number of days that exceed or equal >= 20C for the entire record and the last 20 years and convert these into fractions.
Solution
warm = Tas[Tas >= 20]
warmCount = warm.count()
print(warmCount)
print(warmCount/len(warm))Tas 1436
dtype: int64
Tas 0.031842
dtype: float64
warm = Tas[Tas >= 20].loc["2004":"2023"]
warmCount = warm.count()
print(warmCount)
print(warmCount/len(Tas.loc["2004":"2023"]))Tas 420
dtype: int64
Tas 0.057495
dtype: float64
If you did it well you find that in the total record we have 1436 days with temperatures >= 20C for a total of 45097 days, which results in 3.2% of the days. For the last 20 years we have 420 days >= 20C for a total record length of 7305 which results in 5.7% of the days. As you can see these numbers don’t match the once you found for Question 7.
Using the QQ plot you made earlier, can you see a reason why the normal distribution is not doing a good job for this assessment, especially for the high values? Would you expect that the answers for Questions 7 and 8 are closer when looking at <= 0C?
Solution
There is a smaller deviation between the observations and the theoretical 1:1 line so we would expect the values <= 0C to better follow the normal distribution.Repeat Questions 6, 7 and 8 but now for <= 0C. Was your initial estimate correct?
Solution
cold = Tas[Tas <= 0]
coldCount = cold.count()
coldCount/len(cold)Tas 0.066612
dtype: float64
Using the distribution
tasMean = Tas.mean()
tasSD = Tas.std()
stats.norm.cdf(0, tasMean, tasSD)array([0.06377848])
Or for the last 20 years
cold = Tas[Tas <= 0].loc["2004":"2023"]
coldCount = cold.count()
coldCount/len(cold)Tas 0.042026
dtype: float64
Using the distribution
tasMean = Tas.loc["2004":"2023"].mean()
tasSD = Tas.loc["2004":"2023"].std()
stats.norm.cdf(0, tasMean, tasSD)array([0.03955379])
If all went well you find that the chance of <= 0C temperatures is around 6.5% for the entire data record (depending on the method used) and 4.2% for the last 20 years.
Can you conclude from your assessment of Tas <= 0C that temperatures are rising and that iceskating becomes less and less likely? Or would you like to propose additional calculations or data requirements?
Solution
For some proper iceskating we also need to include the temporal probability, we need multiple days in a row with temperatures below 0C. So not only do we need to check the unconditional probability but also the chance of have multiple freezing daysUsing the same weather data the KNMI did an in-depth assessment, you can compare your answers to what they found and see if your finding are in line.
Exploring precipitation data
As you could already see from the Questions 2,3 and 4 the precipitation data is certainly not normally distributed, it actually follows a different type of distribution. You have learned a distribution has multiple moments.
Obtain the first three moments for the daily precipitation data and don’t forget to use the dropna() function to ensure that you can calculate those. You will have to rely on the Scipy functions for a selection of summary statistics for the higher order moments. Tip: you can use the seperate function for the moments or otherwise a function that directly describes a set of moments.
Solution
The first three moments of a distrbution are the mean, variance, skewness. You can calculate those by using the code below
Precip = pd.read_csv("../Data/dailyPrecipitation.csv", parse_dates=True, index_col=0).dropna()
np.mean(Precip)np.float64(2.202092921070281)
np.var(Precip)np.float64(19.26110637105162)
stats.skew(Precip)array([3.66369029])
Alternatively you could use the desrcibe function from Scipy.
stats.describe(Precip)DescribeResult(nobs=np.int64(43241), minmax=(array([0.]), array([63.9])), mean=array([2.20209292]), variance=array([19.26155182]), skewness=array([3.66369029]), kurtosis=array([20.57153234]))
You find that the precipitation violates the normal distribution properties and is highly skewed. As a result precipitation often follows a gamma distribution, which is bounded by zero with a long tail. In Python you can fit any kind of distribution to your data and check if it makes sense. Below you find example to fit both a normal and gamma distribution
mean, std = stats.norm.fit(Precip)
randomData = stats.norm.rvs(mean, std, size= len(Precip))
plt.hist(Precip, alpha=0.5)
plt.hist(randomData, alpha=0.5)
plt.show()
alpha, loc, beta = stats.gamma.fit(Precip)
randomData = stats.gamma.rvs(alpha, loc, beta, size= len(Precip))
plt.hist(Precip, alpha=0.5)
plt.hist(randomData, alpha=0.5)
plt.show()
You can see straight away that fitting the normal distribution is not working for this example and the gamma distribution is more suited for this purpose. We will not further explore this specific gamma distribution, but you know now how to obtain the parameters for any distribution in Python. We will now however further explore the annual maximum precipitation event.
Start by resampling the precipitation data to obtain the annual maximum value using the resample function. After that make a histogram of the data and describe what you see.
Solution
annualMaxPrecip = Precip.resample("YE").max()
### Or for older Python versions
#annualMaxPrecip = Precip.resample("Y").max()
annualMaxPrecip.plot(kind="hist")
Does the annual maximum precipitation reflect a normal distribution or not?
Solution
# Calculate the quantiles of the precipitation data
real_quantiles = annualMaxPrecip["Precip"].quantile(q=np.linspace(0.01, 0.99, 100))
# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))
# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")
# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()
stats.shapiro(annualMaxPrecip)ShapiroResult(statistic=np.float64(0.8985685894648064), pvalue=np.float64(1.825672009377136e-07))
Now transform the data to a logarithmic and make a histogram plot. Does it now reflect a normal distribution?
Solution
annualMaxPrecip["logAnnualMaxPrecip"] = np.log10(annualMaxPrecip)
annualMaxPrecip["logAnnualMaxPrecip"].plot(kind="hist")
Or more directly in the plotting function
annualMaxPrecip["Precip"].plot(kind="hist", logx=True)
If all is well you find that the logarithmic annual maximum precipitation does approach a normal distribution. The advantage of this is that we can now use all the tools and test we have available for the normal distribution to test different hypothesis on the data.
Exploring binominal and Poisson distributions
We can also ask ourselves the question how often does “x” happen, where “x” is a certain event or extreme. Let’s go back to the temperature data and the iceskating example. In the Netherlands people are always concerned whether or not we can have a Elf-stedentocht or when we can go skating in general. Use the variable you created for Question 11, which should be a dataframe that has True and False values from 1901 onwards. We have already concluded that there is 6.5% chance of obtaining a day <= 0C. Let’s assume that this value is the true chance of having a <= 0C. We can start drawing random samples from the full dataset.
## To ensure we get the same random samples and thus the same answers.
np.random.seed(10)
sampleSize = 100
sampleData = Tas.sample(sampleSize)
fraction = sampleData[Tas <= 0].count()/sampleSize
print(fraction)Tas 0.09
dtype: float64
You find that the fraction is higher than the expected value of 6.5% or 0.065. This is because our sample size is insufficient to correctly reflect the data observational record. Again this confirms that sample size is very relevant when doing your work, getting enough data is key in the real world and Geosciences as well. We will explore how many samples we need using a for-loop.
## To ensure we get the same random samples and thus the same answers.
np.random.seed(10)
## Sample size
samplePoints = np.arange(5,40000,20)
## Empty array for the output
fraction = np.zeros(len(samplePoints))
## For-loop that loops over the samplePoints and at the same time over index numbers ranging from 0 to the length of the samplePoint variable
for s, i in zip(samplePoints, range(len(samplePoints))):
sampleData = Tas.sample(s)
fraction[i] = sampleData[Tas <= 0].count().iloc[0]/s
## Plotting the output with log x-axis
plt.plot(samplePoints, fraction, "-")
plt.xscale('log')
plt.show()
You will observe that you need a rather large sample size to obtain the “True” value of 6.5% and that it is still converging even with 40000 samples present.
Do you think it will take longer or shorter to obtain the “True” exceedance of a value if you take a threshold with a higher probability, for example 10C. Test this with the script above and provide the figure
Solution
## To ensure we get the same random samples and thus the same answers.
np.random.seed(10)
## Sample size
samplePoints = np.arange(5,40000,20)
## Empty array for the output
fraction = np.zeros(len(samplePoints))
## For-loop that loops over the samplePoints and at the same time over index numbers ranging from 0 to the length of the samplePoint variable
for s, i in zip(samplePoints, range(len(samplePoints))):
sampleData = Tas.sample(s)
fraction[i] = sampleData[Tas <= 10].count().iloc[0]/s
## Plotting the output with log x-axis
plt.plot(samplePoints, fraction, "-")
plt.xscale('log')
plt.show()
We can also ask ourselves the question how many days will we get in the coming year that have temperature <= 0C. For this we can use the Poisson distribution to generate likely distributions of freezing temperatures in every year. We know from the previous analysis that we have 3004 days that <= 0C for a total record length of 45097 days, which mean on average we have 3004/45097*365.25 = 24.31 days with <= 0C per year on average since 1901. Now lets plot a Poisson distribution.
np.random.seed(10)
avgDays = Tas[Tas <= 0].count().iloc[0]/len(Tas)*365
print(avgDays)
randomYears = stats.poisson.rvs(avgDays, size=1000)
plt.hist(randomYears)24.313368960241256
(array([ 5., 26., 86., 194., 223., 278., 127., 45., 12., 4.]),
array([ 9. , 12.2, 15.4, 18.6, 21.8, 25. , 28.2, 31.4, 34.6, 37.8, 41. ]),
<BarContainer object of 10 artists>)

Or if you want to directly calculate the probabilities you can do the following:
x = np.arange(5,45)
y = stats.poisson.pmf(x, avgDays)
plt.plot(x, y, '-')
We can also use the Poisson distribution to ask questions like what is the likelihood of having less than 15 freezing day? Use the stats.poisson.cdf(k, mu) to find the chance of having less than 15 freezing days in a given year.
Solution
avgDays = Tas[Tas <= 0].count().iloc[0]/len(Tas)*365
stats.poisson.cdf(15, avgDays)np.float64(0.03009273491856656)
If you have paid carefull attention you might have noticed that in this case the Poisson distribution is approaching a normal distribution. This means again that we the Central Limit Theory comes into play. Now please lower the probability of 24.31 to something like 3 days per year does this still approach a normal distribution and if not why not?
Solution
np.random.seed(10)
avgDays = 3
print(avgDays)
randomYears = stats.poisson.rvs(avgDays, size=1000)
plt.hist(randomYears)3
(array([ 59., 164., 229., 208., 158., 104., 49., 17., 9., 3.]),
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]),
<BarContainer object of 10 artists>)

Or if you want to directly calculate the probabilities you can do the following:
x = np.arange(0,8)
y = stats.poisson.pmf(x, avgDays)
plt.plot(x, y, '-')
Using a t-distribution
When we are working with problems that have small samples size, for example the field observations where sample sizes might be as low as several measurements maybe up to a 100. Since in these measurements the true variance of the mean is unknown, it must be approximated by the sample standard error of the mean. And the ratio between the sample mean and the standard error had a distribution that is called the t-distribution, and converges for larger values towards the normal distribution.
We can now use both the normal and t-distribution for the Temperature dataset with the fitting information from Question 12. We first fit both a t-distribution with a limited sample and a normal distribution and then plot the probability density function (PDF) for both distributions.
sampleTas = Tas.sample(15, random_state=1)
tStat, loc, scale = stats.t.fit(sampleTas)
degreeF = len(sampleTas) - 2
mean, std = stats.norm.fit(sampleTas)
xLocs = np.arange(Tas.min().iloc[0], Tas.max().iloc[0], 0.05)
plt.plot(xLocs, stats.t.pdf(xLocs, degreeF, loc, scale), "-", color = "Red", label="t-dist")
plt.plot(xLocs, stats.norm.pdf(xLocs, mean, std), "-", color = "Blue", label="normal-dist")
plt.legend()
plt.show()
And again for the full sample of data.
tStat, loc, scale = stats.t.fit(Tas)
mean, std = stats.norm.fit(Tas)
degreeF = len(Tas) - 2
mean, std = stats.norm.fit(Tas)
xLocs = np.arange(Tas.min().iloc[0], Tas.max().iloc[0], 0.05)
plt.plot(xLocs, stats.t.pdf(xLocs, degreeF, loc, scale), "-", color = "Red", label="t-dist")
plt.plot(xLocs, stats.norm.pdf(xLocs, mean, std), "-", color = "Blue", label="normal-dist")
plt.legend()
plt.show()
Do you think there is a significant difference between the t-distribution and the normal distribution for small samples? And for the large sample?
Solution
For smaller samples we do see the heavier tails, however if we provide more datapoints than we see that it slowly approaches the normal distribution as the theory tells usNow we will run the same experiment on a smaller data sample. First we create the subset of the temperature data.
np.random.seed(10)
sampleSize = 100
sampleData = Tas.sample(sampleSize)What happens if you use a smaller subset of the temperature data, does the t-distribution still approximates the normal distribution of the full dataset? Adjust the codes about to show the Cumulative Density Function (CDF) of the sampleData and describe your interpretation of the CDF.
Solution
Tas.hist(cumulative=True, density=True, histtype='step',label="Full set")array([[<Axes: title={'center': 'Tas'}>]], dtype=object)

sampleSize = 100
sampleData = Tas.sample(sampleSize)
sampleData.hist(cumulative=True, density=True, histtype='step',label="sample set")array([[<Axes: title={'center': 'Tas'}>]], dtype=object)

*Or if you want to make it into one figure**
plt.hist(Tas, cumulative=True, density=True, histtype='step',label="Full set")
sampleSize = 100
sampleData = Tas.sample(sampleSize)
plt.hist(sampleData, cumulative=True, density=True, histtype='step',label="sample set")
plt.legend()
plt.show()
Do your own data exploration
Q = pd.read_csv("../Data/dailyDischarge.csv", parse_dates=True, index_col=0).dropna()You have loaded the daily discharge data for the Rhine river, now start and exploring. You should make the following things:
Some plots that allow you to understand what the data looks like
Describe some relevant statistical values
Describe if the data is normal or not and why
Answer the questions:
How often do we get floods above 8000m3s-1
How often do we go below 1000m3s-1
Calculate the observed exceedance for flood and drought per year
Tip you can calculate the annual occurrence by comparing the values to your threshold
Then calculate the number of times you actually have an event
Flood: Q[Q>8000].resample(“YE”).count()
Drought: Q[Q<1000].resample(“YE”).count()
Then plot the histogram of both selections to look at the distribution (also needed for the next sub-question)
Create a Poisson distribution for both flood and drought
Can you hypothesis why the Poisson distribution doesn’t work for the drought example?
Solution
Some relevant plots for a timeseries and for the histogram
Q.plot()
Q.plot(kind="hist")
Computing the moments of this distribution using Scipy
stats.describe(Q)DescribeResult(nobs=np.int64(44771), minmax=(array([575.]), array([13000.])), mean=array([2200.07870753]), variance=array([1270596.22239363]), skewness=array([2.10692705]), kurtosis=array([6.9744782]))
Make a QQ plot and testing with Shapiro
# Calculate the quantiles of the precipitation data
real_quantiles = Q["Q"].quantile(q=np.linspace(0.01, 0.99, 100))
# Calculate the theoretical quantiles of a normal distribution
norm_quantiles = ss.norm.ppf(np.linspace(0.01, 0.99, 100))
# Create the Q-Q plot
plt.scatter(norm_quantiles, real_quantiles)
plt.xlabel("Theoretical Quantiles (Normal Distribution)")
plt.ylabel("Real Quantiles")
plt.title("Q-Q Plot")
# Add a line of perfect agreement
plt.plot([norm_quantiles[0], norm_quantiles[-1]], [real_quantiles.iloc[0], real_quantiles.iloc[-1]], color='r')
plt.show()
stats.shapiro(Q)/home/runner/.local/lib/python3.12/site-packages/scipy/stats/_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 44771.
res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.8310783131039213), pvalue=np.float64(5.946077911910897e-108))
Flood above 8000m3s-1 per year
flood = Q[Q >= 8000].count().iloc[0]
print(flood)122
Droughts below 1000m3s-1 per year
drought = Q[Q <= 1000].count().iloc[0]
print(drought)2199
First we determine the amount of days on average we have a flood and then plot the observed distribution
floodDays = flood/len(Q)*365
print(floodDays)
floodYears = Q[Q>8000].resample("YE").count()
plt.hist(floodYears)0.994617051216189
(array([99., 2., 7., 7., 2., 1., 4., 0., 0., 1.]),
array([ 0. , 1.3, 2.6, 3.9, 5.2, 6.5, 7.8, 9.1, 10.4, 11.7, 13. ]),
<BarContainer object of 10 artists>)

Now the same for the droughts
droughtDays = drought/len(Q)*365
print(droughtDays)
droughtYears = Q[Q<1000].resample("YE").count()
plt.hist(droughtYears)17.927564718232784
(array([92., 12., 8., 3., 2., 4., 0., 1., 0., 1.]),
array([ 0. , 20.2, 40.4, 60.6, 80.8, 101. , 121.2, 141.4, 161.6,
181.8, 202. ]),
<BarContainer object of 10 artists>)

Two methods again for the flood poisson distribution
randomYears = stats.poisson.rvs(floodDays, size=1000)
plt.hist(randomYears)(array([365., 384., 0., 173., 0., 54., 17., 0., 6., 1.]),
array([0. , 0.6, 1.2, 1.8, 2.4, 3. , 3.6, 4.2, 4.8, 5.4, 6. ]),
<BarContainer object of 10 artists>)

x = np.arange(0,5)
y = stats.poisson.pmf(x, floodDays)
plt.plot(x, y, '-')
Two methods again for the drought poisson distribution
randomYears = stats.poisson.rvs(droughtDays, size=1000)
plt.hist(randomYears)(array([ 7., 54., 149., 370., 224., 141., 47., 5., 2., 1.]),
array([ 5. , 8.3, 11.6, 14.9, 18.2, 21.5, 24.8, 28.1, 31.4, 34.7, 38. ]),
<BarContainer object of 10 artists>)

x = np.arange(0,40)
y = stats.poisson.pmf(x, droughtDays)
plt.plot(x, y, '-')
What you have learned today
If all is well you have learned today:
Simple plots to look at your data
How to see which distribution can be used to describe the data
Which measures you can use to describe a distribution
How to check exceedance of thresholds
How to work with small samples and what distributions to use in that case