Data Visualization

Author

Neha Moopen & Jelle Treep

Published

October 5, 2023

Note
  • Date: Thursday, 5th October 2023
  • Time: 15:00-17:00
  • Location: Living Lab, University Library (Utrecht Science Park)
  • Presentation: Data Visualization
  • Presenters: Neha Moopen & Jelle Treep

Borrowing from the #TidyTuesday initiative, we’re analyzing data from Pet Cats UK!

In this edition of the Programming Cafe, we decided to try a coding challenge focused on data visualization that was accessible to all levels of experience.

Slides

Exercises

Once you have gotten hold of the data, you can attempt to reproduce some of the example visualizations (beginner, intermediate, advanced) we’ve gathered below. You can use whatever programming language and packages/libraries you like and we encourage you to work in pairs/groups - so you can learn together!

Resources

Inspiration

from Data to Viz

R Packages

  • ggplot2
  • leaflet
  • move
  • moveVis

Python Libraries

  • matplotlib
  • geopandas
  • movingpandas
  • (pystac + rioxarray) (to read and visualize sentinel satellite data)

Pet Cats UK

source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-31/readme.md

The data comes from the Movebank for Animal Tracking Data via Data is Plural.

Between 2013 and 2017, Roland Kays et al. convinced hundreds of volunteers in the U.S., U.K., Australia, and New Zealand to strap GPS sensors on their pet cats. The aforelinked datasets include each cat’s characteristics (such as age, sex, neuter status, hunting habits) and time-stamped GPS pings.

When using this dataset, please cite the original article.

Kays R, Dunn RR, Parsons AW, Mcdonald B, Perkins T, Powers S, Shell L, McDonald JL, Cole H, Kikillus H, Woods L, Tindle H, Roetman P (2020) The small home ranges and large local ecological impacts of pet cats. Animal Conservation. doi:10.1111/acv.12563

Additionally, please cite the Movebank data package:

McDonald JL, Cole H (2020) Data from: The small home ranges and large local ecological impacts of pet cats [United Kingdom]. Movebank Data Repository. doi:10.5441/001/1.pf315732

Additional datasets for the US, Australia, and New Zealand are also available for download, but they are not included.

Get the data here

R
library(readr)

cats_uk <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv')
cats_uk_reference <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk_reference.csv')
Python
import pandas as pd

cats_uk = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv')
cats_uk_reference = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk_reference.csv')

Data Dictionary

Full dictionaries are available on Movebank

cats_uk.csv
variable class description
tag_id character A unique identifier for the tag, provided by the data owner. If the data owner does not provide a tag ID, an internal Movebank tag identifier may sometimes be shown.
event_id double An identifier for the set of values associated with each event, i.e. sensor measurement. A unique event ID is assigned to every time-location or other time-measurement record in Movebank. If multiple measurements are included within a single row of a data file, they will share an event ID. If users import the same sensor measurement to Movebank multiple times, a separate event ID will be assigned to each.
visible logical Determines whether an event is visible on the Movebank map. Values are calculated automatically, with TRUE indicating the event has not been flagged as an outlier by algorithm_marked_outlier, import_marked_outlier or manually_marked_outlier, or that the user has overridden the results of these outlier attributes using manually_marked_valid = TRUE. Allowed values are TRUE or FALSE.
timestamp double The date and time corresponding to a sensor measurement or an estimate derived from sensor measurements.
location_long double The geographic longitude of the location as estimated by the sensor. Positive values are east of the Greenwich Meridian, negative values are west of it.
location_lat double The geographic longitude of the location as estimated by the sensor. Positive values are east of the Greenwich Meridian, negative values are west of it.
ground_speed double The estimated ground speed provided by the sensor or calculated between consecutive locations. Units are reportedly m/s, which indicates that there is likely a problem with this data (either the units were reported erroneously or their is an issue with the sensor data).
height_above_ellipsoid double The estimated height above the ellipsoid, typically estimated by the tag. Units: meters
algorithm_marked_outlier logical Identifies events marked as outliers using a user-selected filter algorithm in Movebank. Outliers have the value TRUE.
manually_marked_outlier logical Identifies events flagged manually as outliers, typically using the Event Editor in Movebank, and may also include outliers identified using other methods. Outliers have the value TRUE.
study_name character The name of the study in Movebank.
cats_uk_reference.csv
variable class description
tag_id character A unique identifier for the tag, provided by the data owner. If the data owner does not provide a tag ID, an internal Movebank tag identifier may sometimes be shown.
animal_id character An individual identifier for the animal, provided by the data owner. If the data owner does not provide an Animal ID, an internal Movebank animal identifier is sometimes shown.
animal_taxon character The scientific name of the species on which the tag was deployed, as defined by the Integrated Taxonomic Information System (ITIS, www.itis.gov). If the species name can not be provided, this should be the lowest level taxonomic rank that can be determined and that is used in the ITIS taxonomy.
deploy_on_date double The timestamp when the tag deployment started.
deploy_off_date double The timestamp when the tag deployment ended.
hunt logical Whether the animal was allowed to hunt.
prey_p_month double Approximate number of prey caught by the animal per month.
animal_reproductive_condition character The reproductive condition of the animal at the beginning of the deployment.
animal_sex character The sex of the animal, as “m” or “f”.
hrs_indoors double The average number of hours which the animal was indoors per day.
n_cats double The number of cats in the house.
food_dry logical Whether the cat was fed dry food.
food_wet logical Whether the cat was fed wet food.
food_other logical Whether the cat was fed other food.
study_site character A location such as the deployment site or colony, or a location-related group such as the herd or pack name.
age_years double The age of the animal at the beginning of the deployment, in years. “0” indicates that the animal was < 1 year old.

Example Data Visualizations

Beginner

Beginners can try:

  1. Reading in the data

  2. Plotting the number of cats per household

  1. Plotting the number of prey caught per year

  1. Plotting the number of hours spent indoors

  1. Plotting the gender of the cats in the dataset

  • Plotting the age of the cats

The above plots are referenced from: https://github.com/MattHondrakis/TidyTuesday/blob/main/2023/01-31-23/UK-Cats.md. The author has code snippets in R showing how they obtained the plots.

You can get tips on building bar plots in both R & Python on the following page: https://www.data-to-viz.com/graph/histogram.html

Intermediate

Intermediate users have two options:

Creating a treemap of the data.

The above plot is referenced from: https://github.com/poncest/tidytuesday/tree/main/2023/Week_05. The author has a script in R showing how they obtained the plot.

You can get tips on building treemaps in both R & Python on the following page: https://www.data-to-viz.com/graph/treemap.html

Here is a human-readable breakdown of the steps, provided by ChatGPT:

  1. Load Packages & Setup:
    • Load necessary software packages using the pacman library.
    • Set up the figure size and resolution for plotting.
  2. Read in the Data:
    • Load data for the TidyTuesday challenge for 2023, week 05.
    • Clean and organize the data.
  3. Examine the Data:
    • Use glimpse to see the structure of the cats_uk and cats_uk_reference datasets.
    • List unique values in the “tag_id” column of both datasets.
    • Combine the data using a left join operation and save memory.
  4. Tidy the Data:
    • Remove outliers from the data.
    • Convert timestamp data to date format.
    • Group and summarize the data based on specific columns.
    • Clean and prepare the data for visualization.
  5. Create Visualization:
    • Set up plot aesthetics including colors, titles, and captions.
    • Define fonts for text elements.
    • Build the main plot using the ggplot2 package with a treemap representation.
    • Customize plot labels, colors, and themes to make it visually appealing.
Map

Try to create a map (e.g. leaflet (R) or geopandas (python)) and plot a gps track on top of it

Advanced

Advanced users can try plotting the data onto a map and adding an animation element.

The above plot is referenced from: https://github.com/fblpalmeira/cats_uk/blob/main/README.md. The author has a script in R showing how they obtained the plot.

You can get tips on building maps in both R & Python on the following page: https://www.data-to-viz.com/graph/map.html

Here is a human-readable breakdown of the steps, provided by ChatGPT:

  1. Read in the Data: Begin by reading two CSV files from the internet. You can do this using software like R. The files you need are named cats_uk.csv and cats_uk_reference.csv.

  2. View the Data: After reading the data, use a tool or function (like the View function in R) to visually inspect the contents of both data files. This step helps you understand what’s in the data.

  3. Load Libraries: Make sure you have certain software libraries installed and loaded. These libraries include dplyr, move, moveVis, maptools, magrittr, and ggplot2. They provide tools for data manipulation and visualization.

  4. Combine and Filter Data: Use the dplyr library to combine the data from the two CSV files based on a common column called “tag_id.” After combining, filter the data to select only specific cat names like “Bear,” “Coco,” etc.

  5. Convert Timestamps: Convert the timestamps in the filtered data from a text format to a date-time format. This makes it easier to work with time-based data.

  6. Create a Move Object: Using the filtered and timestamp-converted data, create a “move object” that represents the movement of cats. This object specifies the columns for longitude, latitude, time, and track ID.

  7. Align Data: Adjust the timestamps in the move object to align them at a specific resolution, such as daily intervals. This step aggregates the data for each day.

  8. Generate Map Frames: Create map frames to visualize cat movements. You can customize these frames by adding things like labels, a north arrow, and a scale bar.

  9. Add Text: Include text on the map frames, such as a Twitter handle, at a specific location.

  10. Preview a Frame: Optionally, you can preview one of the map frames to see how it looks.

  11. Generate an Animation: Finally, use a function or tool (like animate_frames) to turn the map frames into an animated GIF. This GIF will show the movements of the selected cats over time.

References

Will be updated to be more neat, the authors of the data and code and resources retain all their copyright - they are awesome!