Welcome!

All course materials at tinyurl.com/introRDatasite

Welcome! Who are you?

  • What part of your education are you in? (Bachelor, PhD, prof…)

  • What is your faculty/background? (Economics, Medicine, Biology…)

  • What is your motivation for learning R?

  • What is your experience with R?

This morning’s schedule

  •   9:30   Introductions

  • 10:00   Base R + Exercises 1- 6

  • 11:25   Recap & questions

  • 11:30   Coffee break

  • 11:45   Programming + Exercises 7-9

  • 12:45   Lunch break

  • 13:30   Reconvene for afternoon program

Introduction to R & Data

Part 1: Basics of R

What is R

  • A widely used programming language for data analysis
  • Based on statistical programming language S (1976)
  • Developed by Ross Ihaka & Robert Gentleman (1995)
  • Very active community, with many (often subject-specific) packages
  • Open source, and interoperable!

We will work in Rstudio

  • Integrated Development Environment (IDE) for R
  • Founded by J.J. Allaire, available since 2010
  • Bloody useful! Let’s take a look: please open RStudio!

The Rstudio interface

Course materials

All course materials, videos, information & resources are at tinyurl.com/introRDatasite.

  1. Download the course materials.

  2. Store them in a local (i.e. not on a mounted drive), accessible location.

  3. Unzip the download to create a single folder. What animal is displayed on animal.png?

  4. Double-click the course-materials.Rproj file. Or: Go to File > Open Project > select course-materials.Rproj > Open

  5. From the ‘Files’ menu (bottom right), click baseR_exercises.Rmd.

Running R code

In the baseR_exercises.Rmd:

You can execute Exercise 0 chunk as a whole with the green triangle:

In a script/Rmarkdown document

  • Place your cursor in the line of code you want to execute
  • Press or ctrl + enter
  • When running multiple lines: select all lines, then press ‘Run’ or ctrl + enter

R syntax & data types

Variable assignment

You can assign both numbers and text to a variable:

x <- 6
x <- 'apple'
x <- "hello world"

You will see your variabe (R object) appear in your Environment (top right panel).

See the cheatsheets folder. Or download it.

Do you expect an answer?

Saving information as an R object:

x <- 1

Asking for information to be returned:

x
[1] 1

Note the difference in syntax:

  • <- operator: storing information = no immediate ‘answer’
  • calling up an object, or making calculations: R shows you the answer

Maths functions

You can perform math with your variables:

x * 3
[1] 3

and store the results as new variables:

y <- x + 2

log2(y)
[1] 1.584963

Check “Maths Functions” on the Base R cheatsheet:

Logicals

A logical is TRUE or FALSE, and can also be written as T or F.

Logicals are mostly used as tests:

== is equal to
!= is not
>= larger than or equal to
< smaller than

For example:

x == 6
[1] FALSE
x != 10
[1] TRUE

Go to exercise 1 in baseR_exercises.Rmd

Answers to exercise 1

  1. Do the following calculation in R: 1 plus 5, divided by 9
(1+5)/9
[1] 0.6666667
  1. Assign the result of the calculation to a variable.
x <- (1+5)/9
  1. Test if the result is larger than 1.
x > 1
[1] FALSE
  1. Round off the result to 1 decimal.
round(x,1)
[1] 0.7

Vectors in R

Combining data: creating vectors

Vectors are created with the function c()

A numeric vector:

c(1,2,3)
[1] 1 2 3

A character vector:

c("a","b","c")
[1] "a" "b" "c"

A logical vector:

c(T,TRUE,F)
[1]  TRUE  TRUE FALSE

Combining data: creating vectors

What is this vector?

c(TRUE,"a",3)
[1] "TRUE" "a"    "3"   

Yep, a character vector!

Vector type defaults to the “lowest common denominator”: everything can be a character, but not everything can be a number or a logical.

Order:

  1. Character
  2. Numeric
  3. Logical

Vector functions

Vectors can be used in mathematical operations

p <- 1:5
p
[1] 1 2 3 4 5
mean(p)
[1] 3
p * 2
[1]  2  4  6  8 10
    p         2             
1 2 2
2 2 4
3 2 6
4 2 8
5 2 10

Vector functions

Operations with multiple vectors are performed by aligning the index

q <- 5:1
q
[1] 5 4 3 2 1
p * q
[1] 5 8 9 8 5
    p        q              
1 5 5
2 4 8
3 3 9
4 2 8
5 1 5

Go to exercise 2

Answers to exercise 2

  1. Meet Ann, Bob, Chloe, and Dan. Create a character vector called “name” and add these names to the vector using the c() function.
name <- c("Ann", "Bob", "Chloe", "Dan")
  1. How old are Ann, Bob, Chloe, and Dan? Create a numeric vector called “age” and add their ages to the vector. You can decide their ages.
age <- c(35,22,50,51)
  1. Use the class() function to check the data type of the name and age vectors.
class(name)
[1] "character"
class(age)
[1] "numeric"

Answers to exercise 2 (continued)

  1. What is their average age? Use a function in R to calculate this. Tip: use the Maths Functions section of the Base R cheat sheet!
mean(age)
[1] 39.5

Data structures

Data structures: vector

We have two vectors: name and age

name
[1] "Ann"   "Bob"   "Chloe" "Dan"  
age
[1] 35 22 50 51

How do we combine them?

Into a one-dimensional vector: c()

c(name,age)
[1] "Ann"   "Bob"   "Chloe" "Dan"   "35"    "22"    "50"    "51"   

Data structures: data frame

How about combining name and age in a two-dimensional table structure?

data.frame(name,age)
   name age
1   Ann  35
2   Bob  22
3 Chloe  50
4   Dan  51

Two dimensions:

  • Rows
  • Columns

Data structures: list

Or: in a multi-dimensional list.

list(name,age)
[[1]]
[1] "Ann"   "Bob"   "Chloe" "Dan"  

[[2]]
[1] 35 22 50 51

Dimensions: any

Lists can contain any R object. Not just dataframes and vectors, but also other lists.

Data structures: summary

 

          number of dimensions   function  
vector 1     c()
data frame 2     data.frame()
list any number     list()

NB: dataframes and lists appear under Data in the Environment (top right panel in RStudio), vectors under Values.

Factors

Special type of vector, defined by levels. Usually as categorical variable in a data frame.

# Create vector country
country <- c("UK","USA","USA","UK")
country
[1] "UK"  "USA" "USA" "UK" 
# Turn country into a factor
country_fac <- as.factor(country)
country_fac
[1] UK  USA USA UK 
Levels: UK USA
df <- data.frame(name, age, country_fac)
df
   name age country_fac
1   Ann  35          UK
2   Bob  22         USA
3 Chloe  50         USA
4   Dan  51          UK
summary(df)
     name                age        country_fac
 Length:4           Min.   :22.00   UK :2      
 Class :character   1st Qu.:31.75   USA:2      
 Mode  :character   Median :42.50              
                    Mean   :39.50              
                    3rd Qu.:50.25              
                    Max.   :51.00              

Go to exercise 3

Answers to exercise 3

  1. Create a vector called “country” containing four countries of your choice.
country <- c("UK", "US", "NL", "BE")
  1. Create a data frame called “df” combining name, age, and country.
df <- data.frame(name, age, country)
  1. Create a list called “mylist” with the 3 vectors and 1 dataframe you have just created.
mylist <- list(name, age, country, df)

Indexing vectors & lists

Selecting vector elements

Go to exercise 4

Answers to exercise 4

  1. Return only the first number in your vector age.
age[1]
[1] 35
  1. Return the 2nd and 4th name in your vector name.
name[c(2,4)]
[1] "Bob" "Dan"
  1. Return only ages under 30 from your vector age.
age[age<30]
[1] 22

Indexing a data frame

Indexing a data frame

Indexing columns

By position:

df[,2]
[1] 35 22 50 51

By name (as a character string):

df[,"age"]
[1] 35 22 50 51

By name (as an object):

df$age
[1] 35 22 50 51

Indexing rows

By position:

df[2,]
  name age country
2  Bob  22      US

By content:

df[df$name=="Bob",]
  name age country
2  Bob  22      US

Combining rows and columns

df[df$name=="Bob","age"]
[1] 22

Go to exercise 5

Answers to exercise 5

  1. From your dataframe df, return complete rows for everyone living in a country of your choice.
df[df$country=="UK", ]
  name age country
1  Ann  35      UK
  1. Return only the names of everyone in your data frame df under 40.
df[df$age<40, "name"]
[1] "Ann" "Bob"
  1. Return the columns name and age together.
df[, c("name","age")]
   name age
1   Ann  35
2   Bob  22
3 Chloe  50
4   Dan  51

Selecting from a list

Selecting a list element from mylist:

mylist[1]
[[1]]
[1] "Ann"   "Bob"   "Chloe" "Dan"  

Selecting the content of a list element:

mylist[[1]]
[1] "Ann"   "Bob"   "Chloe" "Dan"  

Subselection in the content of a list element:

mylist[[4]][2]
  age
1  35
2  22
3  50
4  51
mylist[[4]][1,2]
[1] 35

Missing data

Not Available (NA)

Let’s add a column to our data:

df$pet <- c("cat","none","",NA)

df
   name age country  pet
1   Ann  35      UK  cat
2   Bob  22      US none
3 Chloe  50      NL     
4   Dan  51      BE <NA>

Notice that:

  • we know that Bob has no pets.
  • we do not know if Dan has pets.
  • the value for Chloe is empty.

Predict the answer (see Exercise 6)

5 == 5
[1] TRUE
5 == NA
[1] NA
NA == NA
[1] NA
is.na(NA)
[1] TRUE

So: want to test if a value is NA? Use is.na()!

NULL: data does not exist

Do we know about our participants’ jobs?

# Select the column "job" from df
df$job
NULL
NA Information is Not Available
NULL Information does not exist
none or 0 Data entry specifying content of 0
"" Empty character value

Programming: if statements

If statement: a conditional

An if statement tests if a condition is TRUE or FALSE and exectues code depending on the outcome of that test.

If statement in R

  • To build an if-statement, start with the function if():

    if()
  • Within the (), insert the condition you want to test for:

    if(number > 10)
  • Within the {}, insert the code that should be executed if the condition is met:

    if(number > 10) {
        test_result <- "number is greater than 10"
    }
  • You can expand the statement with else {} if the condition is not met.

    if(number > 10) {
        test_result <- "number is greater than 10"
    } else {
        test_result <- "number is not greater than 10"
    }

Go to exercise 7

Answers to exercise 7

Make an if statement that tests if a number is larger than 18. Assign the result to the variable age_category.

number <- 8

if(number >= 18){
  age_category <- "adult"
} else{
  age_category <- "minor"
}

print(age_category)
[1] "minor"

Programming: functions

Functions: a sequence

Functions consist of (multiple) instruction(s) that form a cohesive unit: function

A function can be repeated on different inputs:

mean(df$age)
[1] 39.5
mean(1:100)
[1] 50.5

Functions

Functions can also be used to make a complex line of code easier to write/read:

You write the function once:

find_bobs_age <- function(data){
  bobs_age <- data[data$name == "Bob", "age"]
  return(bobs_age)
}

Now, every time you want to find Bob’s age you use:

find_bobs_age(df)
[1] 22

Functions are the bread and butter of programming!

A good script will consist mostly of functions, with a minimal amount of code that applies the functions.

Functions in R

  • To make a function, use the function function():

    myFun <- function()
  • You assign names to the user’s input in the function’s arguments:

    myFun <- function(arg1, arg2)
  • The sequence of operations is in the body of the function (between { }):

    myFun <- function(arg1, arg2){
        multiplication <- arg1 * arg2
    }
  • The output of the function is placed in a return statement:

    myFun <- function(arg1, arg2){
        multiplication <- arg1 * arg2
        return(multiplication)
    }

Using your own function

First, run the code with the function itself. It will appear in your environment:

Now, you can use the function:

myFun(3,4)
[1] 12
myFun(90,71)
[1] 6390

Go to exercise 8

Answers to exercise 8

Turn the if-statement from the last exercise into a function. Let the user provide the value for number, and return the age_category.

test_age <- function(number){
  if(number >= 18){
    age_category <- "adult"
  } else{
    age_category <- "minor"
  }
  return(age_category)
}
# Test the function
test_age(20)
[1] "adult"
test_age(2)
[1] "minor"

Programming: loops

A for-loop

  • Perform the same action(s) for mutiple inputs at a time
  • Input is an object with multiple similar elements (e.g., a vector, row in a dataframe, etc.) that can be iterated over

A for-loop in R

  • A loop starts with the iterable object (in this case the vector 1:5), and the temporary name for each item (in this case a_number):

    for(a_number in 1:5)
  • Within { }, you place the instructions:

    for(a_number in 1:5){
      print(a_number)
    }
    [1] 1
    [1] 2
    [1] 3
    [1] 4
    [1] 5

Note that a_number is 1 in the first iteration of the loop, 2 in the second, etc. It does not exist outside the for loop!

Go to exercise 9

Answers to exercise 9

Go over the age column in your dataframe df, and for each age: print() the age category using the test_age function from the previous exercise.

print(df$age)
[1] 35 22 50 51
for(the_age in df$age){
  test <- test_age(the_age)
  print(test)
}
[1] "adult"
[1] "adult"
[1] "adult"
[1] "adult"

Recap “Basics of R”

Which bracket does what?

         [ ]

 

         ( )

 

         { }

 

Indexing vectors, lists, dataframes…

 

Passing arguments to functions

 

Defining content of if-statements, functions, loops, etc.

 

You can speak R!

What data types have you encountered so far?

logical
numeric
character

How can data be missing?
NA (not available)
NULL (non-existent)
"" (empty)

What data structures have you encountered?
vector (one dimension)
data frame (two dimensions)
list (++ dimensions)

Functions

What functions have you encountered so far?

c()
data.frame()
is.na()
mean()
summary()

Programming basics

  • If-statements
  • Functions
  • For-loops

Help!

  • How does a function work? Type in your console:
    ?mean

  • Use a search engine (often useful: Stackoverflow)

  • (Generative AI)

    • Upside: can write all code for you
    • Downside: you learn less, can hallucinate, not certain what sources it uses

Help!

Help? Scroll down for examples!

Note on R projects

When you start programming for yourself:

  • Create a folder dedicated to your project
  • Start a new R project: File > New Project > Existing Directory
  • An .RProj file will be created

Advantages:

  • Automatically set your working directory to that folder
  • Automatically retrieve only the history and objects from that R project
  • More reproducible (relative vs. absolute paths)

Lunch break!