import pandas as pd
= pd.read_csv('../../course_materials/data/surveys.csv') # in your notebook the path should be 'data/surveys.csv' surveys_df
11 Afternoon Exercises: Working with data
11.0.1 Exercise 0
Type the following commands and check the outputs. Can you tell what each command does? What is the difference between commands with and without parenthesis?
# Answer: the dimensions of the dataframe
surveys_df.shape # Answer: the column names of the dataframe
surveys_df.columns # Answer: the index (row labels) of the dataframe
surveys_df.index # Answer: the data types of each column
surveys_df.dtypes <try_various_integers_here>) # Answer: the first n rows of the dataframe
surveys_df.head(<try_various_integers_here>) # Answer: the last n rows of the dataframe surveys_df.tail(
11.0.2 Exercise 1
Perform some basic statistics on the weight column. For practical reasons, it can be useful to first create a variable weight
that contains the just the weight column. It will make the code look a bit cleaner. Can you tell what each method listed below does? Look at our explorative plot, do the statistics make sense?
=surveys_df['weight'] # Answer: creates a new variable that contains the weight column
weight min() # Answer: the minimum value of the weight column
weight.max() # Answer: the maximum value of the weight column
weight.# Answer: the mean value of the weight column
weight.mean() # Answer: the standard deviation of the weight column
weight.std() # Answer: the number of non-NaN values in the weight column weight.count()
11.0.3 Exercise 2
- Swap the order of column names in
surveys_df[['plot_id', 'species_id']]
- Repeat one of the column names like
surveys_df[['plot_id', 'plot_id', 'species_id']]
. What do the results look like and why?
Answer: the column names are repeated and the data is displayed twice. Column names do not have to be unique.
- Which error occurs in
surveys_df['plot_id', 'species_id']
and why?
Answer: KeyError: (‘plot_id’, ‘species_id’). The column names are not in a list. We need double square brackets to select multiple columns.
- Which error occurs in
surveys_df['speciess']
?
Answer: KeyError: ‘speciess’. The column name does not exist. Typo.
print(surveys_df[['species_id', 'plot_id']])
species_id plot_id
0 NL 2
1 NL 3
2 DM 2
3 DM 7
4 DM 3
... ... ...
35544 AH 15
35545 AH 15
35546 RM 10
35547 DO 7
35548 NaN 5
[35549 rows x 2 columns]
'plot_id', 'plot_id', 'species_id']] surveys_df[[
plot_id | plot_id | species_id | |
---|---|---|---|
0 | 2 | 2 | NL |
1 | 3 | 3 | NL |
2 | 2 | 2 | DM |
3 | 7 | 7 | DM |
4 | 3 | 3 | DM |
... | ... | ... | ... |
35544 | 15 | 15 | AH |
35545 | 15 | 15 | AH |
35546 | 10 | 10 | RM |
35547 | 7 | 7 | DO |
35548 | 5 | 5 | NaN |
35549 rows × 3 columns
'plot_id', 'species_id'] surveys_df[
KeyError: ('plot_id', 'species_id')
'speciess'] surveys_df[
KeyError: 'speciess'
11.0.4 Exercise 3
What happens when you call:
surveys_df[0:1]
Answer: shows the first row of the dataframesurveys_df[:4]
Answer: shows the first 4 rows of the dataframe from index 0 to index 3surveys_df[:-1]
Answer: shows all rows of the dataframe except the last row
0:1]
surveys_df[4]
surveys_df[:-1] surveys_df[:
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 rows × 9 columns
11.0.5 Exercise 4
- Find all entries in the column
sex
which do not contain anM
or aF
. - Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0.
= surveys_df[(surveys_df['sex'] != 'M') & (surveys_df['sex'] != 'F')]
df print("Number of rows not female or male:", len(df))
Number of rows not female or male: 2511
= surveys_df[((surveys_df['sex'] == 'M') | (surveys_df['sex'] == 'F')) & surveys_df['weight'] > 0] df
11.0.6 Exercise 5: Putting it all together
- Clean the column sex (leave out samples of which we do not know whether they are male or female) and save the result as a new dataframe
clean_df
. - Fill undefined weight values with the mean of all valid weights in
surveys_df
. - Calculate the average weight of that new DataFrame
clean_df
# Step 1
# sex is 'F' or 'M'. The `|` means or.
= surveys_df[(surveys_df['sex']=='F') | (surveys_df['sex']=='M')]
clean_df # Alternative solution: select columns where 'not' sex is null. The `~` means not.
= surveys_df[~(surveys_df['sex'].isnull())]
clean_df
# Step 2
'weight'].mean())
clean_df.weight.fillna(surveys_df[
# Step 3
print("Average weight of surveys_df:", surveys_df['weight'].mean())
print("Average weight of clean_df:", clean_df['weight'].mean())
Average weight of surveys_df: 42.672428212991356
Average weight of clean_df: 42.60316325896464
11.0.7 Exercise 6
Let’s see in which plots animals get more food. Calculate the average weight per plot! Complete the code below.
= surveys_df.groupby("plot_id")
grouped_data 'weight'].mean() grouped_data[
plot_id
1 51.822911
2 52.251688
3 32.654386
4 47.928189
5 40.947802
6 36.738893
7 20.663009
8 47.758001
9 51.432358
10 18.541219
11 43.451757
12 49.496169
13 40.445660
14 46.277199
15 27.042578
16 24.585417
17 47.889593
18 40.005922
19 21.105166
20 48.665303
21 24.627794
22 54.146379
23 19.634146
24 43.679167
Name: weight, dtype: float64
11.0.8 Exercise 7
See below a more complex grouping example. Investigate the group keys and row indexes for this more complex grouping example. Why are there more than 48 groups? Answer: nan values are not ignored when grouping. Calculate the average weight per group. What happened to the third group and why does it not turn up in our statistics? Answer: the third group contains only nan values and is therefore not included in the statistics.
= surveys_df.groupby(['sex', 'plot_id'])
grouped_data print(len(grouped_data.groups))
grouped_data.groups.keys()
72
dict_keys([('F', 1), ('F', 2), ('F', 3), ('F', 4), ('F', 5), ('F', 6), ('F', 7), ('F', 8), ('F', 9), ('F', 10), ('F', 11), ('F', 12), ('F', 13), ('F', 14), ('F', 15), ('F', 16), ('F', 17), ('F', 18), ('F', 19), ('F', 20), ('F', 21), ('F', 22), ('F', 23), ('F', 24), ('M', 1), ('M', 2), ('M', 3), ('M', 4), ('M', 5), ('M', 6), ('M', 7), ('M', 8), ('M', 9), ('M', 10), ('M', 11), ('M', 12), ('M', 13), ('M', 14), ('M', 15), ('M', 16), ('M', 17), ('M', 18), ('M', 19), ('M', 20), ('M', 21), ('M', 22), ('M', 23), ('M', 24), (nan, 1), (nan, 2), (nan, 3), (nan, 4), (nan, 5), (nan, 6), (nan, 7), (nan, 8), (nan, 9), (nan, 10), (nan, 11), (nan, 12), (nan, 13), (nan, 14), (nan, 15), (nan, 16), (nan, 17), (nan, 18), (nan, 19), (nan, 20), (nan, 21), (nan, 22), (nan, 23), (nan, 24)])
'weight'].mean() grouped_data[
sex plot_id
F 1 46.311138
2 52.561845
3 31.215349
4 46.818824
5 40.974806
6 36.352288
7 20.006135
8 45.623011
9 53.618469
10 17.094203
11 43.515075
12 49.831731
13 40.524590
14 47.355491
15 26.670236
16 25.810427
17 48.176201
18 36.963514
19 21.978599
20 52.624406
21 25.974832
22 53.647059
23 20.564417
24 47.914405
M 1 55.950560
2 51.391382
3 34.163241
4 48.888119
5 40.708551
6 36.867388
7 21.194719
8 49.641372
9 49.519309
10 19.971223
11 43.366197
12 48.909710
13 40.097754
14 45.159378
15 27.523691
16 23.811321
17 47.558853
18 43.546952
19 20.306878
20 44.197279
21 22.772622
22 54.572531
23 18.941463
24 39.321503
Name: weight, dtype: float64
11.0.9 Exercise 8
Would it make sense to group our data frame by the column weight? Why or why not?
# In real life nearly every sample has a unique value. So nearly every sample would
# be placed in an own group.
# In our training data you can see that there are quite some values for weight. So
# usually it is not a good idea to categorise (group) data on such values.
print("Number of rows:", len(surveys_df))
print(len(surveys_df['weight'].unique())) #includes nan
print(len(surveys_df.groupby(['weight']).groups)) #does not include nan
Number of rows: 35549
256
255
11.0.10 Exercise 9
In the given example of vertical concatenation, you concatenated two DataFrames with the same columns. What would happen if the two DataFrames to concatenate have different column number and names?
- Create a new DataFrame using the last 10 rows of the species DataFrame (
species_df
); - Concatenate vertically
surveys_df_sub_first10
and your just created DataFrame; - Print the concatenated DataFrame info on the screen. How may rows does it have? What happened to the columns? Explain why you get this result.
= pd.read_csv("../../course_materials/data/species.csv")
species_df = species_df.tail(10)
species_df_sub_last10
= surveys_df.head(10)
surveys_df_sub_first10 = pd.concat([surveys_df_sub_first10, species_df_sub_last10], axis=0)
vert_concat
vert_concat
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 7.0 | 16.0 | 1977.0 | 2.0 | NL | M | 32.0 | NaN | NaN | NaN | NaN |
1 | 2.0 | 7.0 | 16.0 | 1977.0 | 3.0 | NL | M | 33.0 | NaN | NaN | NaN | NaN |
2 | 3.0 | 7.0 | 16.0 | 1977.0 | 2.0 | DM | F | 37.0 | NaN | NaN | NaN | NaN |
3 | 4.0 | 7.0 | 16.0 | 1977.0 | 7.0 | DM | M | 36.0 | NaN | NaN | NaN | NaN |
4 | 5.0 | 7.0 | 16.0 | 1977.0 | 3.0 | DM | M | 35.0 | NaN | NaN | NaN | NaN |
5 | 6.0 | 7.0 | 16.0 | 1977.0 | 1.0 | PF | M | 14.0 | NaN | NaN | NaN | NaN |
6 | 7.0 | 7.0 | 16.0 | 1977.0 | 2.0 | PE | F | NaN | NaN | NaN | NaN | NaN |
7 | 8.0 | 7.0 | 16.0 | 1977.0 | 1.0 | DM | M | 37.0 | NaN | NaN | NaN | NaN |
8 | 9.0 | 7.0 | 16.0 | 1977.0 | 1.0 | DM | F | 34.0 | NaN | NaN | NaN | NaN |
9 | 10.0 | 7.0 | 16.0 | 1977.0 | 6.0 | PF | F | 20.0 | NaN | NaN | NaN | NaN |
44 | NaN | NaN | NaN | NaN | NaN | SS | NaN | NaN | NaN | Spermophilus | spilosoma | Rodent |
45 | NaN | NaN | NaN | NaN | NaN | ST | NaN | NaN | NaN | Spermophilus | tereticaudus | Rodent |
46 | NaN | NaN | NaN | NaN | NaN | SU | NaN | NaN | NaN | Sceloporus | undulatus | Reptile |
47 | NaN | NaN | NaN | NaN | NaN | SX | NaN | NaN | NaN | Sigmodon | sp. | Rodent |
48 | NaN | NaN | NaN | NaN | NaN | UL | NaN | NaN | NaN | Lizard | sp. | Reptile |
49 | NaN | NaN | NaN | NaN | NaN | UP | NaN | NaN | NaN | Pipilo | sp. | Bird |
50 | NaN | NaN | NaN | NaN | NaN | UR | NaN | NaN | NaN | Rodent | sp. | Rodent |
51 | NaN | NaN | NaN | NaN | NaN | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
52 | NaN | NaN | NaN | NaN | NaN | ZL | NaN | NaN | NaN | Zonotrichia | leucophrys | Bird |
53 | NaN | NaN | NaN | NaN | NaN | ZM | NaN | NaN | NaN | Zenaida | macroura | Bird |
We get a total of 20 rows and 12 columns. The original dataframes together had a total of 13 columns. As they both have a column species_id
, this one is collapsed. All other columns are padded with NaN
values. We expect 20 rows, as we are putting two DataFrames of 10 rows one after the other. The padding of the columns happens because these two DataFrames do not have the same column names. To keep all the information that was in the original DataFrames, the padding of columns that occur in only one of the two is necessary.
11.0.11 Exercise 10
- Looking at the
inner_join
example, can you explain how much of each of the two DataFrames is missing from the result?
Now consider the other types of joins, for each one, can you predict the number of rows and the contents of the resulting DataFrame, based on the diagrams in the picture?
For the outer join;
For the left join;
For the right join.
From the left DataFrame, three rows are not included in the
inner_join
DataFrame. This is because they have a value in theirspecies_id
column that is not present in the right DataFrame. From the right DataFrame, the information of 18 rows is missing from the result. This is because theirspecies_id
column has a value that does not occur in the left DataFrame. Note that the information from the two rows that are represented in the result is duplicated a number of times, as theirspecies_id
value occurs multiple times in the left DataFrame.The result has a total of 28 rows. You may notice that the first seven of those rows are the same as the result of the inner join, followed by the three rows from the left DataFrame that are not represented in the inner join, and finally, the 18 rows from the right DataFrame that are not represented in the inner join. This makes for a total of 7 + 3 + 18 = 28 rows. The outer join preserves all the information from both the left and right DataFrames.
# 2.
= surveys_df.head(10)
left_df = species_df.head(20)
right_df = pd.merge(left_df, right_df, left_on='species_id', right_on='species_id',
outer_join ='outer')
how outer_join
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 7.0 | 16.0 | 1977.0 | 2.0 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2.0 | 7.0 | 16.0 | 1977.0 | 3.0 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3.0 | 7.0 | 16.0 | 1977.0 | 2.0 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4.0 | 7.0 | 16.0 | 1977.0 | 7.0 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5.0 | 7.0 | 16.0 | 1977.0 | 3.0 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
5 | 8.0 | 7.0 | 16.0 | 1977.0 | 1.0 | DM | M | 37.0 | NaN | Dipodomys | merriami | Rodent |
6 | 9.0 | 7.0 | 16.0 | 1977.0 | 1.0 | DM | F | 34.0 | NaN | Dipodomys | merriami | Rodent |
7 | 6.0 | 7.0 | 16.0 | 1977.0 | 1.0 | PF | M | 14.0 | NaN | NaN | NaN | NaN |
8 | 10.0 | 7.0 | 16.0 | 1977.0 | 6.0 | PF | F | 20.0 | NaN | NaN | NaN | NaN |
9 | 7.0 | 7.0 | 16.0 | 1977.0 | 2.0 | PE | F | NaN | NaN | NaN | NaN | NaN |
10 | NaN | NaN | NaN | NaN | NaN | AB | NaN | NaN | NaN | Amphispiza | bilineata | Bird |
11 | NaN | NaN | NaN | NaN | NaN | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
12 | NaN | NaN | NaN | NaN | NaN | AS | NaN | NaN | NaN | Ammodramus | savannarum | Bird |
13 | NaN | NaN | NaN | NaN | NaN | BA | NaN | NaN | NaN | Baiomys | taylori | Rodent |
14 | NaN | NaN | NaN | NaN | NaN | CB | NaN | NaN | NaN | Campylorhynchus | brunneicapillus | Bird |
15 | NaN | NaN | NaN | NaN | NaN | CM | NaN | NaN | NaN | Calamospiza | melanocorys | Bird |
16 | NaN | NaN | NaN | NaN | NaN | CQ | NaN | NaN | NaN | Callipepla | squamata | Bird |
17 | NaN | NaN | NaN | NaN | NaN | CS | NaN | NaN | NaN | Crotalus | scutalatus | Reptile |
18 | NaN | NaN | NaN | NaN | NaN | CT | NaN | NaN | NaN | Cnemidophorus | tigris | Reptile |
19 | NaN | NaN | NaN | NaN | NaN | CU | NaN | NaN | NaN | Cnemidophorus | uniparens | Reptile |
20 | NaN | NaN | NaN | NaN | NaN | CV | NaN | NaN | NaN | Crotalus | viridis | Reptile |
21 | NaN | NaN | NaN | NaN | NaN | DO | NaN | NaN | NaN | Dipodomys | ordii | Rodent |
22 | NaN | NaN | NaN | NaN | NaN | DS | NaN | NaN | NaN | Dipodomys | spectabilis | Rodent |
23 | NaN | NaN | NaN | NaN | NaN | DX | NaN | NaN | NaN | Dipodomys | sp. | Rodent |
24 | NaN | NaN | NaN | NaN | NaN | EO | NaN | NaN | NaN | Eumeces | obsoletus | Reptile |
25 | NaN | NaN | NaN | NaN | NaN | GS | NaN | NaN | NaN | Gambelia | silus | Reptile |
26 | NaN | NaN | NaN | NaN | NaN | NX | NaN | NaN | NaN | Neotoma | sp. | Rodent |
27 | NaN | NaN | NaN | NaN | NaN | OL | NaN | NaN | NaN | Onychomys | leucogaster | Rodent |
- Ten rows. The resulting DataFrame closely resembles the original left DataFrame, but with information from the right DataFrame added to it, where applicable.
# 3.
= pd.merge(left_df, right_df, left_on='species_id', right_on='species_id',
left_join ='left')
how left_join
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN | NaN | NaN | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN | NaN | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN | Dipodomys | merriami | Rodent |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN | Dipodomys | merriami | Rodent |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN | NaN | NaN | NaN |
- 25 rows. The resulting DataFrame closely resembles the original right DataFrame, but with information from the left DataFrame added to it, where applicable. Note that rows from the right DataFrame that have multiple matching rows in the left DataFrame are duplicated.
# 4.
= pd.merge(left_df, right_df, left_on='species_id', right_on='species_id',
right_join ='right')
how right_join
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | AB | NaN | NaN | NaN | Amphispiza | bilineata | Bird |
1 | NaN | NaN | NaN | NaN | NaN | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
2 | NaN | NaN | NaN | NaN | NaN | AS | NaN | NaN | NaN | Ammodramus | savannarum | Bird |
3 | NaN | NaN | NaN | NaN | NaN | BA | NaN | NaN | NaN | Baiomys | taylori | Rodent |
4 | NaN | NaN | NaN | NaN | NaN | CB | NaN | NaN | NaN | Campylorhynchus | brunneicapillus | Bird |
5 | NaN | NaN | NaN | NaN | NaN | CM | NaN | NaN | NaN | Calamospiza | melanocorys | Bird |
6 | NaN | NaN | NaN | NaN | NaN | CQ | NaN | NaN | NaN | Callipepla | squamata | Bird |
7 | NaN | NaN | NaN | NaN | NaN | CS | NaN | NaN | NaN | Crotalus | scutalatus | Reptile |
8 | NaN | NaN | NaN | NaN | NaN | CT | NaN | NaN | NaN | Cnemidophorus | tigris | Reptile |
9 | NaN | NaN | NaN | NaN | NaN | CU | NaN | NaN | NaN | Cnemidophorus | uniparens | Reptile |
10 | NaN | NaN | NaN | NaN | NaN | CV | NaN | NaN | NaN | Crotalus | viridis | Reptile |
11 | 3.0 | 7.0 | 16.0 | 1977.0 | 2.0 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
12 | 4.0 | 7.0 | 16.0 | 1977.0 | 7.0 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
13 | 5.0 | 7.0 | 16.0 | 1977.0 | 3.0 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
14 | 8.0 | 7.0 | 16.0 | 1977.0 | 1.0 | DM | M | 37.0 | NaN | Dipodomys | merriami | Rodent |
15 | 9.0 | 7.0 | 16.0 | 1977.0 | 1.0 | DM | F | 34.0 | NaN | Dipodomys | merriami | Rodent |
16 | NaN | NaN | NaN | NaN | NaN | DO | NaN | NaN | NaN | Dipodomys | ordii | Rodent |
17 | NaN | NaN | NaN | NaN | NaN | DS | NaN | NaN | NaN | Dipodomys | spectabilis | Rodent |
18 | NaN | NaN | NaN | NaN | NaN | DX | NaN | NaN | NaN | Dipodomys | sp. | Rodent |
19 | NaN | NaN | NaN | NaN | NaN | EO | NaN | NaN | NaN | Eumeces | obsoletus | Reptile |
20 | NaN | NaN | NaN | NaN | NaN | GS | NaN | NaN | NaN | Gambelia | silus | Reptile |
21 | 1.0 | 7.0 | 16.0 | 1977.0 | 2.0 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
22 | 2.0 | 7.0 | 16.0 | 1977.0 | 3.0 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
23 | NaN | NaN | NaN | NaN | NaN | NX | NaN | NaN | NaN | Neotoma | sp. | Rodent |
24 | NaN | NaN | NaN | NaN | NaN | OL | NaN | NaN | NaN | Onychomys | leucogaster | Rodent |
11.0.12 Exercise 11
Time to play with plots! Create a multiplot following these instructions: - Using the matplotlib.pyplot function subplots()
, create a single figure (10x10 inches) with four subplots organized in two rows and two columns; - In the top row plot hindfoot_length VS weight for female and male in two different plots with two different colors; - In the bottom row, plot the same data of the top row, but using data collected before (left plot) and after (right plot) 1990; - Give to each plot an appropriate descriptive title and customize the plot labels.
Feel free to use the DataFrame plot
method or plt.scatter
function to plot data points, but be awave that, in any case, the first thing to do is creating Figure and Axes.
EXTRA: The four plots have same x and y axes spanning the same range. Can you remove the space between the four plots? Try it!
from matplotlib import pyplot as plt
= plt.subplots(2,2,figsize=(10,10)) # prepare a matplotlib figure
fig, axes
# Top left plot, male data
'sex']=='M'].plot("hindfoot_length", "weight", kind="scatter", ax=axes[0][0], color='blue')
surveys_df[surveys_df[0][0].set_title('Male data')
axes[0][0].grid()
axes[
# Top right plot, female data
'sex']=='F'].plot("hindfoot_length", "weight", kind="scatter", ax=axes[0][1], color='red')
surveys_df[surveys_df[0][1].set_title('Female data')
axes[0][1].grid()
axes[
= 2000
year
# Bottom left plot, male data
'sex']=='M') & (surveys_df['year'] < year)].plot("hindfoot_length", "weight", kind="scatter", ax=axes[1][0], color='blue')
surveys_df[(surveys_df[1][0].set_title(f'Male data (< {year})')
axes[1][0].grid()
axes[
# Bottom right plot, male data
'sex']=='F') & (surveys_df['year'] >= year)].plot("hindfoot_length", "weight", kind="scatter", ax=axes[1][1], color='red')
surveys_df[(surveys_df[1][1].set_title(f'Female data (>= {year})')
axes[1][1].grid()
axes[
# Removing individual plot labels
for i in range(2):
for j in range(2):
'')
axes[i][j].set_xlabel('')
axes[i][j].set_ylabel(
# Initializing figure labels
"Hindfoot Length [cm]",fontsize=14)
fig.supxlabel("Weight [Kg]",fontsize=14)
fig.supylabel('Scatter plot of weight versus hindfoot length', fontsize=15) fig.suptitle(
Text(0.5, 0.98, 'Scatter plot of weight versus hindfoot length')