7  Grouping, Indexing, Slicing, and Subsetting DataFrames

7.1 Recap: Load the data

import pandas as pd
surveys_df = pd.read_csv('../course_materials/data/surveys.csv')
surveys_df.describe()
record_id month day year plot_id hindfoot_length weight
count 35549.000000 35549.000000 35549.000000 35549.000000 35549.000000 31438.000000 32283.000000
mean 17775.000000 6.477847 15.991195 1990.475231 11.397001 29.287932 42.672428
std 10262.256696 3.396925 8.257366 7.493355 6.799406 9.564759 36.631259
min 1.000000 1.000000 1.000000 1977.000000 1.000000 2.000000 4.000000
25% 8888.000000 4.000000 9.000000 1984.000000 5.000000 21.000000 20.000000
50% 17775.000000 6.000000 16.000000 1990.000000 11.000000 32.000000 37.000000
75% 26662.000000 10.000000 23.000000 1997.000000 17.000000 36.000000 48.000000
max 35549.000000 12.000000 31.000000 2002.000000 24.000000 70.000000 280.000000

In addition to learning about characteristics of our dataset as a whole, we may be interested in analyzing parts (subsets) of our data. For exampe we want to know how heavy our samples are:

surveys_df['weight'].describe()
count    32283.000000
mean        42.672428
std         36.631259
min          4.000000
25%         20.000000
50%         37.000000
75%         48.000000
max        280.000000
Name: weight, dtype: float64

We can also extract one specific metric if we wish:

surveys_df['weight'].min()
surveys_df['weight'].max()
surveys_df['weight'].mean()
surveys_df['weight'].std()
surveys_df['weight'].count()
32283

7.2 Selecting data using column names

In the morning session we saw how to get specific values from dictionaries using keys. We can do the same with DataFrames, in fact we have already accessed the values in a column by the column name. In this section we will show you two ways how to select values, slices of data and subsets of a DataFrame.

surveys_df[['species_id']]
species_id
0 NL
1 NL
2 DM
3 DM
4 DM
... ...
35544 AH
35545 AH
35546 RM
35547 DO
35548 NaN

35549 rows × 1 columns

In contrast to the examples we saw in the previous Section we are using double brackets [[...]] instead of single brackets. The single brackets retrieve a Pandas series of values and can only be applied to one column. The double brackets [[...]] tell Pandas to format the values of several columns as a Pandas DataFrame. With this syntax we can create a DataFrame that consists of the two columns plot_id and species_id!

surveys_df[['plot_id', 'species_id']]
plot_id species_id
0 2 NL
1 3 NL
2 2 DM
3 7 DM
4 3 DM
... ... ...
35544 15 AH
35545 15 AH
35546 10 RM
35547 7 DO
35548 5 NaN

35549 rows × 2 columns

7.3 Slicing subsets of rows

Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax: data[start:stop]. When slicing in pandas the start bound is included in the output. The stop bound is not included. The slicing stops before the stop bound. So if you want to select rows 0, 1 and 2 your code would look like this:

surveys_df[0:3]
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN

We can select specific ranges of our data in both the row and column directions using the row and column labels or using the row and column indices.

Let’s have a look at iloc first. where we use the index of a row and/or column to select it. In the example below we select the first three entries and the columns month, day and year (the second, third and fourth column, remember indexing starts at 0 on Python). The first range of numbers selects the rows, the second the columns:

# iloc[row slicing, column slicing]
surveys_df.iloc[0:3, 1:4]
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977

We can achieve the same result in a different way:

In a first step we select the columns by their names surveys_df[['month', 'day', 'year']]. From the resulting DataFrame we then, in a second step, select the first three rows [0:3]. Putting the two steps together, the code looks like this:

surveys_df[['month', 'day', 'year']][0:3]
month day year
0 7 16 1977
1 7 16 1977
2 7 16 1977

What happens in this command?

# data.iloc[row, column]
surveys_df.iloc[2, 6]
'F'

7.3.1 Summary: Selecting slices, rows and columns

In the first method we extract the column specifying its name. The second method is essentially identical to the first one as the 6th (index 5) element of the Series surveys_df.columns is species_id. The third method uses the method iloc to select all the rows of the 6th column. Additionally there is also a .loc method that can be used for this, but to avoid further confusing we leave it out for now. If your are interested, read this more detailed explanation.

# By name
# --------------------------------------
# Method1
plot_species_1 = surveys_df['species_id']

# By location
# --------------------------------------
# Method2
plot_id_2 = surveys_df[surveys_df.columns[5]]

# Method3
plot_id_3 = surveys_df.iloc[:,5]
# --------------------------------------

Exercise 2 and 3

Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 2 and 3.

7.3.2 Subsetting Data according to user-defined criteria

We can extract subsets of our DataFrame following the general syntax data_frame[<condition_on_data>] is a conditional statement on the DataFrame content itself. You may think at the conditional statement as a question or query you ask to your DataFrame. Here there are some examples:

# What are the data collected in the year 2002?
surveys_df[surveys_df['year'] == 2002]
record_id month day year plot_id species_id sex hindfoot_length weight
33320 33321 1 12 2002 1 DM M 38.0 44.0
33321 33322 1 12 2002 1 DO M 37.0 58.0
33322 33323 1 12 2002 1 PB M 28.0 45.0
33323 33324 1 12 2002 1 AB NaN NaN NaN
33324 33325 1 12 2002 1 DO M 35.0 29.0
... ... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

2229 rows × 9 columns

# What are the data NOT collected in the year 2002?
surveys_df[surveys_df['year'] != 2002]
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
... ... ... ... ... ... ... ... ... ...
33315 33316 12 16 2001 11 NaN NaN NaN NaN
33316 33317 12 16 2001 13 NaN NaN NaN NaN
33317 33318 12 16 2001 14 NaN NaN NaN NaN
33318 33319 12 16 2001 15 NaN NaN NaN NaN
33319 33320 12 16 2001 16 NaN NaN NaN NaN

33320 rows × 9 columns

# What are the data NOT collected in the year 2002? (different syntax)
surveys_df[~(surveys_df['year'] == 2002)]
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
... ... ... ... ... ... ... ... ... ...
33315 33316 12 16 2001 11 NaN NaN NaN NaN
33316 33317 12 16 2001 13 NaN NaN NaN NaN
33317 33318 12 16 2001 14 NaN NaN NaN NaN
33318 33319 12 16 2001 15 NaN NaN NaN NaN
33319 33320 12 16 2001 16 NaN NaN NaN NaN

33320 rows × 9 columns

Our filtering conditions may be very specific, they can target different columns in the DataFrame, and they can be combined using the logical operator “&” which means and:

# What are the data collected between 2000 and 2002 on female species?
surveys_df[(surveys_df['year'] >= 2000) & (surveys_df['year'] <= 2002) &
           (surveys_df['sex'] == 'F')]
record_id month day year plot_id species_id sex hindfoot_length weight
30158 30159 1 8 2000 1 PP F 22.0 17.0
30160 30161 1 8 2000 1 PP F 21.0 17.0
30164 30165 1 8 2000 1 PP F 22.0 15.0
30168 30169 1 8 2000 2 PB F 25.0 24.0
30171 30172 1 8 2000 2 NL F 30.0 137.0
... ... ... ... ... ... ... ... ... ...
35539 35540 12 31 2002 15 PB F 26.0 23.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35546 35547 12 31 2002 10 RM F 15.0 14.0

2582 rows × 9 columns

Spotlight on Syntax: We can introduce linebreaks in commands.

Below we filter for rows with collected data on female species in the year 2000 or 2002. “Give me all data where sex is Female and data is collected in 2000 or 2002”.

The method isin() allows to specify a range of “permitted” values for a certain column. Here it follows another example:

surveys_df[(surveys_df['year'] == 2000) & (surveys_df['sex'] == 'F') & 
           (surveys_df['month'].isin([1,3,4]))]
record_id month day year plot_id species_id sex hindfoot_length weight
30158 30159 1 8 2000 1 PP F 22.0 17.0
30160 30161 1 8 2000 1 PP F 21.0 17.0
30164 30165 1 8 2000 1 PP F 22.0 15.0
30168 30169 1 8 2000 2 PB F 25.0 24.0
30171 30172 1 8 2000 2 NL F 30.0 137.0
... ... ... ... ... ... ... ... ... ...
30637 30638 4 30 2000 20 PP F 22.0 20.0
30640 30641 4 30 2000 20 NL F 30.0 NaN
30645 30646 4 30 2000 24 PP F 20.0 17.0
30647 30648 4 30 2000 17 DM F 36.0 46.0
30648 30649 4 30 2000 17 DO F 36.0 59.0

156 rows × 9 columns

We have also an operator for or. For the sake of showing the syntax, below we fetch all entries from the year 2000 or from the gender female:

print(surveys_df[(surveys_df['year'] == 2000) | (surveys_df['sex'] == 'F')])
       record_id  month  day  year  plot_id species_id sex  hindfoot_length  \
2              3      7   16  1977        2         DM   F             37.0   
6              7      7   16  1977        2         PE   F              NaN   
8              9      7   16  1977        1         DM   F             34.0   
9             10      7   16  1977        6         PF   F             20.0   
10            11      7   16  1977        5         DS   F             53.0   
...          ...    ...  ...   ...      ...        ...  ..              ...   
35539      35540     12   31  2002       15         PB   F             26.0   
35540      35541     12   31  2002       15         PB   F             24.0   
35541      35542     12   31  2002       15         PB   F             26.0   
35542      35543     12   31  2002       15         PB   F             27.0   
35546      35547     12   31  2002       10         RM   F             15.0   

       weight  
2         NaN  
6         NaN  
8         NaN  
9         NaN  
10        NaN  
...       ...  
35539    23.0  
35540    31.0  
35541    29.0  
35542    34.0  
35546    14.0  

[16552 rows x 9 columns]

Summary on logical operators in Pandas

operator python Pandas
and and &
or or |
not ! ~

Exercise 4

Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercise 4.

7.4 Optional: DataFrame Cleaning

A simple exploration of our DataFrame showed us that there are columns full of invalid values (NaN). One of the most important preliminary operations of data analysis is cleaning your data set, i.e. “getting rid” of non-numerical or non-character values. we want to make sure that our data only contains meaningful values.

Now that we mastered selecting, slicing, and subsetting, we can easily clean our DataFrame with few lines of code. Let us have a look at the function isnull. It is a Pandas function which we imported at the beginning with import pandas as pd. Now we can call the functin like this:

pd.isnull(4)
False
pd.isnull([1, 2, 3, '', dict(), None])
array([False, False, False, False, False,  True])

We can pass single values or array-like values to the function. The function will then check for us whether each value is NaN (Not a Number) or None and return a boolean array. Note, that values like the empty string (a strin without any characters in it) or an empty dictionary etc will not count as null value, they do have a type, they only do not contain any values but they are something. null values in python are only NaN and None. When you read in tabular data into a DataFrame empty cells will be shown as NaN. None stands for the type NoneType, which we will not dive into further in this workshop.

With all that kowledge we can now detect null values in the column weight and do something about it. Let us have a look how many null values we can find:

pd.isnull(surveys_df['weight']) # boolean array indicating where null values are found
0         True
1         True
2         True
3         True
4         True
         ...  
35544     True
35545     True
35546    False
35547    False
35548     True
Name: weight, Length: 35549, dtype: bool

We can use the Series with the boolean values as a mask on the DataFrame. Here we only extract the rows of surveys_df where the weight is not defined:

surveys_df[pd.isnull(surveys_df['weight'])] # all lines that have a null value in the column weight
len(surveys_df[pd.isnull(surveys_df['weight'])]) # length
3266

As you can see, in our whole dataset 3266 weight values are not usable. We need to do something with those values.

Another thing that would not make sense are negative weights. Let’s check whether the remaining 32283 values in the weight column are positive:

len(surveys_df[surveys_df['weight'] > 0])
32283

As we see, we have 32283 non-negative weight values. The remaining 3266 values in the weight column are not set, so they are null. How can we impute the values? Let us have a look at the average weight:

surveys_df['weight'].mean()
42.672428212991356

A smooth run, without errors or warnings. As we said several times, Pandas is a library designed for data analysis and when performing data analysis it is very common to deal with not numeric values. In particular, the .mean() method has an argument called skipna that when set True (default value, so we do not need to specify it) excludes NaN values. This means that, in this case, Pandas simply ignores whatever it is not numeric and it performs computations only on numeric values.

If we are not happy with Pandas default behaviour, we can manually decide which value to assign to cells that contain null values. One possible choice is setting them to zero. To do that, we just need to apply the method .fillna(<value>), where <value> is the number we want to substitute to the null value with (in our case, 0).

cleaned_weight1 = surveys_df['weight'].fillna(0)
cleaned_weight_ave1 = cleaned_weight1.mean()
print(cleaned_weight_ave1)
38.751976145601844

You see that when filling the null values with 0, the average weight decreases. This is because the mean is now computed on data with many more zeros compared to the previous one. Conscious of this problem, we may now choose a more appropriate value to “fill” our null values. How about we use the “clean” mean of our first computation?

cleaned_weight2 = surveys_df['weight'].fillna(surveys_df['weight'].mean())
cleaned_weight_ave2 = cleaned_weight2.mean()
print(cleaned_weight_ave2)
42.672428212991356

This time we obtain exactly the same result of our first computation, this is because we substituted the null values with a mean computed excluding the null values.

(Optional) Exercise 5

To deepen the knowledge you can continue with Exercise 5.

7.5 Optional: Grouping

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per site.

As we have seen above we can calculate basic statistics for all records in a single column using the syntax below:

surveys_df['weight'].describe()
count    32283.000000
mean        42.672428
std         36.631259
min          4.000000
25%         20.000000
50%         37.000000
75%         48.000000
max        280.000000
Name: weight, dtype: float64

If we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby() method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.

grouped_data = surveys_df.groupby('sex')
grouped_data.describe()
record_id month ... hindfoot_length weight
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
sex
F 15690.0 18036.412046 10423.089000 3.0 8917.50 18075.5 27250.00 35547.0 15690.0 6.587253 ... 36.0 64.0 15303.0 42.170555 36.847958 4.0 20.0 34.0 46.0 274.0
M 17348.0 17754.835601 10132.203323 1.0 8969.75 17727.5 26454.25 35548.0 17348.0 6.396184 ... 36.0 58.0 16879.0 42.995379 36.184981 4.0 20.0 39.0 49.0 280.0

2 rows × 56 columns

The output is a bit overwhelming. Let’s just have a look at one statistical value, the mean, to understand what is happening here:

grouped_data.mean()
/tmp/ipykernel_2430/1133710423.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  grouped_data.mean()
record_id month day year plot_id hindfoot_length weight
sex
F 18036.412046 6.587253 15.880943 1990.644997 11.440854 28.836780 42.170555
M 17754.835601 6.396184 16.078799 1990.480401 11.098282 29.709578 42.995379

We see that the data is divided into two groups, one group where the value in the column sex equals “F” and another group where the value in the column sex equals “M”. The statistics is then calculated for all samples in that specific group for each of the columns in the dataframe. Note that samples annotated with sex equals NaN and column values with NaN are left out.

7.6 Optional: Structure of a groupby object

We can investigate which rows are assigned to which group as follows:

print(type(grouped_data.groups)) # dictionary
print("Sexes: ", grouped_data.groups.keys()) # keys are the unique values of the column we grouped by
print("Rows belonging to sex 'F': ", grouped_data.groups['F']) # values are row indexes 
<class 'pandas.io.formats.printing.PrettyDict'>
Sexes:  dict_keys(['F', 'M'])
Rows belonging to sex 'F':  Int64Index([    2,     6,     8,     9,    10,    14,    15,    16,    19,
               20,
            ...
            35531, 35532, 35535, 35536, 35537, 35539, 35540, 35541, 35542,
            35546],
           dtype='int64', length=15690)

7.7 Optional: Grouping by multiple columns

Now let’s have a look at a more complex grouping example. We want an overview statistics of the weight of all females and males by plot id. So in fact we want to group by sex and by plot_id at the same time.

This will give us exactly 48 groups for our survey data:

  • female, plot id = 1
  • female, plot id = 2
  • female, plot id = 24
  • male, plot id = 1
  • male, plot id = 24

Why 48 groups? We have 24 unique values for plot_id. Per plot we have two groups of samples, female and male. Hence, the grouping returns 48 groups.

grouped_data = surveys_df.groupby(['plot_id', 'sex'])
grouped_data['weight'].describe()
count mean std min 25% 50% 75% max
plot_id sex
1 F 826.0 46.311138 33.240958 5.0 26.00 40.0 50.00 196.0
M 1072.0 55.950560 41.035686 4.0 37.00 46.0 54.00 231.0
2 F 954.0 52.561845 45.547697 5.0 25.00 40.0 51.00 274.0
M 1114.0 51.391382 46.690887 5.0 24.00 42.0 50.00 278.0
3 F 873.0 31.215349 30.687451 4.0 15.00 23.0 34.00 199.0
M 827.0 34.163241 40.260426 5.0 13.00 23.0 39.00 250.0
4 F 850.0 46.818824 33.560664 5.0 28.00 40.0 47.00 200.0
M 1010.0 48.888119 32.254168 4.0 32.00 44.5 50.00 187.0
5 F 516.0 40.974806 36.396966 5.0 21.00 35.0 45.00 248.0
M 573.0 40.708551 31.250967 6.0 21.00 40.0 49.00 240.0
6 F 721.0 36.352288 29.513333 5.0 19.00 29.0 41.00 188.0
M 739.0 36.867388 30.867779 6.0 18.00 31.0 46.00 241.0
7 F 326.0 20.006135 17.895937 6.0 12.00 17.0 23.00 170.0
M 303.0 21.194719 23.971252 4.0 11.00 17.0 23.00 235.0
8 F 817.0 45.623011 31.045426 5.0 25.00 42.0 50.00 178.0
M 962.0 49.641372 34.820355 5.0 29.00 45.0 52.00 173.0
9 F 823.0 53.618469 35.572793 6.0 35.00 43.0 54.00 177.0
M 984.0 49.519309 31.888023 6.0 37.00 46.0 50.00 275.0
10 F 138.0 17.094203 14.074820 7.0 10.00 13.0 20.00 130.0
M 139.0 19.971223 25.061068 4.0 10.00 12.0 22.00 237.0
11 F 796.0 43.515075 29.627049 5.0 27.00 40.0 46.00 208.0
M 994.0 43.366197 28.425105 6.0 25.00 43.0 49.00 212.0
12 F 1040.0 49.831731 43.790247 6.0 26.00 41.0 48.25 264.0
M 1174.0 48.909710 39.301038 7.0 25.25 43.0 50.00 280.0
13 F 610.0 40.524590 36.109806 5.0 21.00 31.0 42.00 192.0
M 757.0 40.097754 31.753448 6.0 20.00 34.0 47.00 241.0
14 F 692.0 47.355491 29.563455 5.0 37.00 43.0 48.00 211.0
M 1029.0 45.159378 25.272173 5.0 35.00 44.0 50.00 222.0
15 F 467.0 26.670236 31.983137 4.0 12.50 18.0 26.00 198.0
M 401.0 27.523691 38.631271 4.0 10.00 18.0 25.00 259.0
16 F 211.0 25.810427 20.902314 4.0 13.00 21.0 31.00 158.0
M 265.0 23.811321 14.663726 5.0 11.00 20.0 35.00 61.0
17 F 874.0 48.176201 37.485528 6.0 27.00 41.0 49.00 192.0
M 1011.0 47.558853 34.082010 4.0 27.00 45.0 51.00 216.0
18 F 740.0 36.963514 35.184417 5.0 17.00 28.5 40.00 212.0
M 607.0 43.546952 41.864279 7.0 18.00 33.0 48.00 256.0
19 F 514.0 21.978599 14.008822 6.0 12.00 20.0 29.00 139.0
M 567.0 20.306878 12.553954 4.0 10.00 19.0 25.00 100.0
20 F 631.0 52.624406 55.257665 5.0 17.00 30.0 48.00 220.0
M 588.0 44.197279 43.361503 5.0 17.00 34.0 47.00 223.0
21 F 596.0 25.974832 22.619863 4.0 11.00 24.0 31.00 188.0
M 431.0 22.772622 18.984554 4.0 9.00 19.0 32.00 190.0
22 F 646.0 53.647059 38.588538 5.0 29.00 39.0 54.00 161.0
M 648.0 54.572531 38.841066 6.0 31.00 44.0 53.00 212.0
23 F 163.0 20.564417 18.933945 8.0 12.00 16.0 23.00 199.0
M 205.0 18.941463 17.979740 4.0 10.00 12.0 22.00 131.0
24 F 479.0 47.914405 49.112574 6.0 21.00 33.0 44.00 251.0
M 479.0 39.321503 42.003947 4.0 17.00 24.0 45.00 230.0

With the same grouped data we can compare the average weight between the males and females per plot and visualise that as stacked bar plot:

grouped_data = surveys_df.groupby(['plot_id', 'sex'])
grouped_data['weight'].mean().plot(kind='bar')

For each key, there is a bar in the plot. We can get a more insightful plot when we unstack the values for each key and only put the numbers on the x-axis:

grouped_data['weight'].mean().unstack().plot(kind = 'bar')

7.8 Optional: Summary grouping

Grouping is one of the most common operation in data analysis. Data often consists of different measurements on the same samples. In many cases we are not only interested in one particular measurement but in the cross product of measurements. In the picture below we labeled samples with green lines, blue dots and red lines. We are now interested how these three different groups relate to each other given the all other measurements in the dataframe. Pandas’ groupby function gives us the means to compare these three groups with several built-in statistical methods.

Grouping sketch

Optional Exercise 6 to 8

To deepen the knowledge you can do Exercises 6 to 8.

After you finished the exercises please come back to this document and continue with the following chapter.