Preparation

In the next session, we will finally dive into text mining with R! You are required to complete the installation and preparation described on this page prior to attending the session. We will not have time to run through these steps during the meeting.

If you get with any of the steps below, don’t hesitate to reach out to the instructors during the Walk-In Hours of Research Data Management Support. The Walk-In Hours take place every Monday from 15:00 to 17:00 at the University Library in the Science Park. However, one instructor will be available at the University Library in the city center (in the seating area near the Digital Humanities Lab) and you are welcome free to request a meeting online (via MS Teams) during these hours as well.

You can also contact the course coordinator, Neha Moopen, by email at n.moopen@uu.nl

Install & Load Packages

Let’s be sure you have all the needed packages installed and that loading them into RStudio works fine.

These are two distinct operations: installing means downloading and installing all the files related to a package in your computer and this is usually a one-time operation, while loading a package means making the package’s functions and other features ready to be used in your R session.

The latter is something you need to do every time you start (or re-start) your R session. To install and load packages we use the R functions install.packages() and library(), respectively. Remember that install.packages() requires its arguments to be specified between double quotes, while library() accepts both double quotes or the plain name of the package.

Let’s install the packages:

install.packages("tidyverse")
install.packages("tidytext")
install.packages("wordcloud")

And load them:

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
Warning: package 'tidytext' was built under R version 4.2.3
library(wordcloud)
Warning: package 'wordcloud' was built under R version 4.2.3
Loading required package: RColorBrewer
A few words about the packages we are going to use:
  • tidyverse: this is an “opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures”. Among the many tidyverse packages, we are going to use in particular:
    • readr: allowing to read data from files;
    • dplyr: providing tools for data manipulation;
    • tidyr: providing tools to make your data “tidy”;
    • forcats: providing tools to solve problems with factors, and advanced R topic we would not spend much time talking about;
    • ggplot2: a system for creating graphics.
  • tidytext: an R package for text mining based on the tidy data principles;
  • wordcloud: a package to generate word cloud plots.

Project Organization

1. Create a project in RStudio

  • In RStudio, click File -> New Project -> New Directory -> New Project.

  • Give your project directory the following name: text-mining-in-r

  • Make sure your project directory folder is created (as a subdirectory) in an accessible place on your system.

  • Select Open project in a new session.

2. Create a project structure suited for reproducible work

  • You can generate a directory structure by running the following piece of code in your R console:
dir.create("data", recursive = TRUE)
dir.create("lexicons", recursive = TRUE)
dir.create("docs", recursive = TRUE)
dir.create("results", recursive = TRUE)
dir.create("R", recursive = TRUE)

Download Data

Download the data file linked below (right-click and select Save link as…) and place it in the data folder you just created:

ianalyzer_query.csv

Download Lexicon

Download the lexicon file linked below (right-click and select Save link as…) and place it in the lexicons folder:

NRC_lexicon.txt

Download the R Markdown Files

Download the following Rmd files (right-click and select Save link as…) and place these files in the R folder you just created.:

Some Reading

Read the section Just before starting: tidyverse pipelines in the next page. It’s alright if you don’t understand everything but it will help you become more familiar with the syntax we will be using in the text mining session.

Bonus Reading

The Text Mining with R textbook is our go-to reference for the upcoming session. Take some time to skim through the case studies presented in the book:

Again, it’s alright if you don’t understand everything. This is to familiarize you with the techniques we will be trying out in the next session and see how results of text mining techniques are presented and interpreted.