7Grouping, Indexing, Slicing, and Subsetting DataFrames
7.1 Recap: Load the data
import pandas as pdsurveys_df = pd.read_csv('../course_materials/data/surveys.csv')surveys_df.describe()
record_id
month
day
year
plot_id
hindfoot_length
weight
count
35549.000000
35549.000000
35549.000000
35549.000000
35549.000000
31438.000000
32283.000000
mean
17775.000000
6.477847
15.991195
1990.475231
11.397001
29.287932
42.672428
std
10262.256696
3.396925
8.257366
7.493355
6.799406
9.564759
36.631259
min
1.000000
1.000000
1.000000
1977.000000
1.000000
2.000000
4.000000
25%
8888.000000
4.000000
9.000000
1984.000000
5.000000
21.000000
20.000000
50%
17775.000000
6.000000
16.000000
1990.000000
11.000000
32.000000
37.000000
75%
26662.000000
10.000000
23.000000
1997.000000
17.000000
36.000000
48.000000
max
35549.000000
12.000000
31.000000
2002.000000
24.000000
70.000000
280.000000
In addition to learning about characteristics of our dataset as a whole, we may be interested in analyzing parts (subsets) of our data. For exampe we want to know how heavy our samples are:
surveys_df['weight'].describe()
count 32283.000000
mean 42.672428
std 36.631259
min 4.000000
25% 20.000000
50% 37.000000
75% 48.000000
max 280.000000
Name: weight, dtype: float64
We can also extract one specific metric if we wish:
In the morning session we saw how to get specific values from dictionaries using keys. We can do the same with DataFrames, in fact we have already accessed the values in a column by the column name. In this section we will show you two ways how to select values, slices of data and subsets of a DataFrame.
surveys_df[['species_id']]
species_id
0
NL
1
NL
2
DM
3
DM
4
DM
...
...
35544
AH
35545
AH
35546
RM
35547
DO
35548
NaN
35549 rows × 1 columns
In contrast to the examples we saw in the previous Section we are using double brackets [[...]] instead of single brackets. The single brackets retrieve a Pandas series of values and can only be applied to one column. The double brackets [[...]] tell Pandas to format the values of several columns as a Pandas DataFrame. With this syntax we can create a DataFrame that consists of the two columns plot_id and species_id!
surveys_df[['plot_id', 'species_id']]
plot_id
species_id
0
2
NL
1
3
NL
2
2
DM
3
7
DM
4
3
DM
...
...
...
35544
15
AH
35545
15
AH
35546
10
RM
35547
7
DO
35548
5
NaN
35549 rows × 2 columns
7.3 Slicing subsets of rows
Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax: data[start:stop]. When slicing in pandas the start bound is included in the output. The stop bound is not included. The slicing stops before the stop bound. So if you want to select rows 0, 1 and 2 your code would look like this:
surveys_df[0:3]
record_id
month
day
year
plot_id
species_id
sex
hindfoot_length
weight
0
1
7
16
1977
2
NL
M
32.0
NaN
1
2
7
16
1977
3
NL
M
33.0
NaN
2
3
7
16
1977
2
DM
F
37.0
NaN
We can select specific ranges of our data in both the row and column directions using the row and column labels or using the row and column indices.
Let’s have a look at iloc first. where we use the index of a row and/or column to select it. In the example below we select the first three entries and the columns month, day and year (the second, third and fourth column, remember indexing starts at 0 on Python). The first range of numbers selects the rows, the second the columns:
We can achieve the same result in a different way:
In a first step we select the columns by their names surveys_df[['month', 'day', 'year']]. From the resulting DataFrame we then, in a second step, select the first three rows [0:3]. Putting the two steps together, the code looks like this:
surveys_df[['month', 'day', 'year']][0:3]
month
day
year
0
7
16
1977
1
7
16
1977
2
7
16
1977
What happens in this command?
# data.iloc[row, column]surveys_df.iloc[2, 6]
'F'
7.3.1 Summary: Selecting slices, rows and columns
In the first method we extract the column specifying its name. The second method is essentially identical to the first one as the 6th (index 5) element of the Series surveys_df.columns is species_id. The third method uses the method iloc to select all the rows of the 6th column. Additionally there is also a .loc method that can be used for this, but to avoid further confusing we leave it out for now. If your are interested, read this more detailed explanation.
# By name# --------------------------------------# Method1plot_species_1 = surveys_df['species_id']# By location# --------------------------------------# Method2plot_id_2 = surveys_df[surveys_df.columns[5]]# Method3plot_id_3 = surveys_df.iloc[:,5]# --------------------------------------
Exercise 2 and 3
Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 2 and 3.
7.3.2 Subsetting Data according to user-defined criteria
We can extract subsets of our DataFrame following the general syntax data_frame[<condition_on_data>] is a conditional statement on the DataFrame content itself. You may think at the conditional statement as a question or query you ask to your DataFrame. Here there are some examples:
# What are the data collected in the year 2002?surveys_df[surveys_df['year'] ==2002]
record_id
month
day
year
plot_id
species_id
sex
hindfoot_length
weight
33320
33321
1
12
2002
1
DM
M
38.0
44.0
33321
33322
1
12
2002
1
DO
M
37.0
58.0
33322
33323
1
12
2002
1
PB
M
28.0
45.0
33323
33324
1
12
2002
1
AB
NaN
NaN
NaN
33324
33325
1
12
2002
1
DO
M
35.0
29.0
...
...
...
...
...
...
...
...
...
...
35544
35545
12
31
2002
15
AH
NaN
NaN
NaN
35545
35546
12
31
2002
15
AH
NaN
NaN
NaN
35546
35547
12
31
2002
10
RM
F
15.0
14.0
35547
35548
12
31
2002
7
DO
M
36.0
51.0
35548
35549
12
31
2002
5
NaN
NaN
NaN
NaN
2229 rows × 9 columns
# What are the data NOT collected in the year 2002?surveys_df[surveys_df['year'] !=2002]
record_id
month
day
year
plot_id
species_id
sex
hindfoot_length
weight
0
1
7
16
1977
2
NL
M
32.0
NaN
1
2
7
16
1977
3
NL
M
33.0
NaN
2
3
7
16
1977
2
DM
F
37.0
NaN
3
4
7
16
1977
7
DM
M
36.0
NaN
4
5
7
16
1977
3
DM
M
35.0
NaN
...
...
...
...
...
...
...
...
...
...
33315
33316
12
16
2001
11
NaN
NaN
NaN
NaN
33316
33317
12
16
2001
13
NaN
NaN
NaN
NaN
33317
33318
12
16
2001
14
NaN
NaN
NaN
NaN
33318
33319
12
16
2001
15
NaN
NaN
NaN
NaN
33319
33320
12
16
2001
16
NaN
NaN
NaN
NaN
33320 rows × 9 columns
# What are the data NOT collected in the year 2002? (different syntax)surveys_df[~(surveys_df['year'] ==2002)]
record_id
month
day
year
plot_id
species_id
sex
hindfoot_length
weight
0
1
7
16
1977
2
NL
M
32.0
NaN
1
2
7
16
1977
3
NL
M
33.0
NaN
2
3
7
16
1977
2
DM
F
37.0
NaN
3
4
7
16
1977
7
DM
M
36.0
NaN
4
5
7
16
1977
3
DM
M
35.0
NaN
...
...
...
...
...
...
...
...
...
...
33315
33316
12
16
2001
11
NaN
NaN
NaN
NaN
33316
33317
12
16
2001
13
NaN
NaN
NaN
NaN
33317
33318
12
16
2001
14
NaN
NaN
NaN
NaN
33318
33319
12
16
2001
15
NaN
NaN
NaN
NaN
33319
33320
12
16
2001
16
NaN
NaN
NaN
NaN
33320 rows × 9 columns
Our filtering conditions may be very specific, they can target different columns in the DataFrame, and they can be combined using the logical operator “&” which means and:
# What are the data collected between 2000 and 2002 on female species?surveys_df[(surveys_df['year'] >=2000) & (surveys_df['year'] <=2002) & (surveys_df['sex'] =='F')]
record_id
month
day
year
plot_id
species_id
sex
hindfoot_length
weight
30158
30159
1
8
2000
1
PP
F
22.0
17.0
30160
30161
1
8
2000
1
PP
F
21.0
17.0
30164
30165
1
8
2000
1
PP
F
22.0
15.0
30168
30169
1
8
2000
2
PB
F
25.0
24.0
30171
30172
1
8
2000
2
NL
F
30.0
137.0
...
...
...
...
...
...
...
...
...
...
35539
35540
12
31
2002
15
PB
F
26.0
23.0
35540
35541
12
31
2002
15
PB
F
24.0
31.0
35541
35542
12
31
2002
15
PB
F
26.0
29.0
35542
35543
12
31
2002
15
PB
F
27.0
34.0
35546
35547
12
31
2002
10
RM
F
15.0
14.0
2582 rows × 9 columns
Spotlight on Syntax: We can introduce linebreaks in commands.
Below we filter for rows with collected data on female species in the year 2000 or 2002. “Give me all data where sex is Female and data is collected in 2000 or 2002”.
The method isin() allows to specify a range of “permitted” values for a certain column. Here it follows another example:
record_id month day year plot_id species_id sex hindfoot_length \
2 3 7 16 1977 2 DM F 37.0
6 7 7 16 1977 2 PE F NaN
8 9 7 16 1977 1 DM F 34.0
9 10 7 16 1977 6 PF F 20.0
10 11 7 16 1977 5 DS F 53.0
... ... ... ... ... ... ... .. ...
35539 35540 12 31 2002 15 PB F 26.0
35540 35541 12 31 2002 15 PB F 24.0
35541 35542 12 31 2002 15 PB F 26.0
35542 35543 12 31 2002 15 PB F 27.0
35546 35547 12 31 2002 10 RM F 15.0
weight
2 NaN
6 NaN
8 NaN
9 NaN
10 NaN
... ...
35539 23.0
35540 31.0
35541 29.0
35542 34.0
35546 14.0
[16552 rows x 9 columns]
Summary on logical operators in Pandas
operator
python
Pandas
and
and
&
or
or
|
not
!
~
Exercise 4
Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercise 4.
7.4 Optional: DataFrame Cleaning
A simple exploration of our DataFrame showed us that there are columns full of invalid values (NaN). One of the most important preliminary operations of data analysis is cleaning your data set, i.e. “getting rid” of non-numerical or non-character values. we want to make sure that our data only contains meaningful values.
Now that we mastered selecting, slicing, and subsetting, we can easily clean our DataFrame with few lines of code. Let us have a look at the function isnull. It is a Pandas function which we imported at the beginning with import pandas as pd. Now we can call the functin like this:
pd.isnull(4)
False
pd.isnull([1, 2, 3, '', dict(), None])
array([False, False, False, False, False, True])
We can pass single values or array-like values to the function. The function will then check for us whether each value is NaN (Not a Number) or None and return a boolean array. Note, that values like the empty string (a strin without any characters in it) or an empty dictionary etc will not count as null value, they do have a type, they only do not contain any values but they are something. null values in python are only NaN and None. When you read in tabular data into a DataFrame empty cells will be shown as NaN. None stands for the type NoneType, which we will not dive into further in this workshop.
With all that kowledge we can now detect null values in the column weight and do something about it. Let us have a look how many null values we can find:
pd.isnull(surveys_df['weight']) # boolean array indicating where null values are found
We can use the Series with the boolean values as a mask on the DataFrame. Here we only extract the rows of surveys_df where the weight is not defined:
surveys_df[pd.isnull(surveys_df['weight'])] # all lines that have a null value in the column weightlen(surveys_df[pd.isnull(surveys_df['weight'])]) # length
3266
As you can see, in our whole dataset 3266 weight values are not usable. We need to do something with those values.
Another thing that would not make sense are negative weights. Let’s check whether the remaining 32283 values in the weight column are positive:
len(surveys_df[surveys_df['weight'] >0])
32283
As we see, we have 32283 non-negative weight values. The remaining 3266 values in the weight column are not set, so they are null. How can we impute the values? Let us have a look at the average weight:
surveys_df['weight'].mean()
42.672428212991356
A smooth run, without errors or warnings. As we said several times, Pandas is a library designed for data analysis and when performing data analysis it is very common to deal with not numeric values. In particular, the .mean() method has an argument called skipna that when set True (default value, so we do not need to specify it) excludes NaN values. This means that, in this case, Pandas simply ignores whatever it is not numeric and it performs computations only on numeric values.
If we are not happy with Pandas default behaviour, we can manually decide which value to assign to cells that contain null values. One possible choice is setting them to zero. To do that, we just need to apply the method .fillna(<value>), where <value> is the number we want to substitute to the null value with (in our case, 0).
You see that when filling the null values with 0, the average weight decreases. This is because the mean is now computed on data with many more zeros compared to the previous one. Conscious of this problem, we may now choose a more appropriate value to “fill” our null values. How about we use the “clean” mean of our first computation?
This time we obtain exactly the same result of our first computation, this is because we substituted the null values with a mean computed excluding the null values.
(Optional) Exercise 5
To deepen the knowledge you can continue with Exercise 5.
7.5 Optional: Grouping
We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per site.
As we have seen above we can calculate basic statistics for all records in a single column using the syntax below:
surveys_df['weight'].describe()
count 32283.000000
mean 42.672428
std 36.631259
min 4.000000
25% 20.000000
50% 37.000000
75% 48.000000
max 280.000000
Name: weight, dtype: float64
If we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby() method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.
The output is a bit overwhelming. Let’s just have a look at one statistical value, the mean, to understand what is happening here:
grouped_data.mean()
/tmp/ipykernel_2430/1133710423.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
grouped_data.mean()
record_id
month
day
year
plot_id
hindfoot_length
weight
sex
F
18036.412046
6.587253
15.880943
1990.644997
11.440854
28.836780
42.170555
M
17754.835601
6.396184
16.078799
1990.480401
11.098282
29.709578
42.995379
We see that the data is divided into two groups, one group where the value in the column sex equals “F” and another group where the value in the column sex equals “M”. The statistics is then calculated for all samples in that specific group for each of the columns in the dataframe. Note that samples annotated with sex equals NaN and column values with NaN are left out.
7.6 Optional: Structure of a groupby object
We can investigate which rows are assigned to which group as follows:
print(type(grouped_data.groups)) # dictionaryprint("Sexes: ", grouped_data.groups.keys()) # keys are the unique values of the column we grouped byprint("Rows belonging to sex 'F': ", grouped_data.groups['F']) # values are row indexes
Now let’s have a look at a more complex grouping example. We want an overview statistics of the weight of all females and males by plot id. So in fact we want to group by sex and by plot_id at the same time.
This will give us exactly 48 groups for our survey data:
female, plot id = 1
female, plot id = 2
…
female, plot id = 24
male, plot id = 1
…
male, plot id = 24
Why 48 groups? We have 24 unique values for plot_id. Per plot we have two groups of samples, female and male. Hence, the grouping returns 48 groups.
For each key, there is a bar in the plot. We can get a more insightful plot when we unstack the values for each key and only put the numbers on the x-axis:
Grouping is one of the most common operation in data analysis. Data often consists of different measurements on the same samples. In many cases we are not only interested in one particular measurement but in the cross product of measurements. In the picture below we labeled samples with green lines, blue dots and red lines. We are now interested how these three different groups relate to each other given the all other measurements in the dataframe. Pandas’ groupby function gives us the means to compare these three groups with several built-in statistical methods.
Optional Exercise 6 to 8
To deepen the knowledge you can do Exercises 6 to 8.
After you finished the exercises please come back to this document and continue with the following chapter.