Show me the code
install.packages("fmsb")Copernicus institute of sustainable development, Utrecht University
Amsterdam sustainability institute, Vrije Universiteit Amsterdam
Departamento de Solos Centro de Ciências Agrárias, Universidade Federal de Vicosa, Brazil
In this tutorial, you will learn how to fit a logistic regression model in R, interpret the model coefficients, and calculate predicted probabilities. We will also explore how to evaluate the performance of the model and assess how well it explains the observed data.
Let’s get started!
In this tutorial, we will need to use functions in the fmsb package. Make sure to install this package before starting the tutorial.
To work on this tutorial, you will need to load the readxl, tidyverse, and fmsb packages.
Start by importing the data used in this course (see previous tutorials).
Logistic regression is a statistical method used to model the relationship between a response variable and one or more explanatory variables when the response variable follows a binomial distribution, such as binary variables (1 or 0, success or failure, etc.) or proportion data.
The logistic model describes how the probability of a binary outcome (e.g., presence/absence or success/failure) changes with one or more explanatory variables. Let (\(p\)) be the probability that the outcome equals 1 (e.g., probability that a species is present at a site, or probability of success). Instead of modeling this probability directly, the logistic model first expresses it as odds (Equation 1).
\[ odds=\frac{p}{1-p} \tag{1}\]
The odds describe how likely the event is to occur (\(p\)) compared to not occurring (\(1-p\)). For example, if an event has a probability of success (\(p\)) of 0.75, the odds are 3 (0.75/0.25), meaning the event is three times as likely to occur as not occur.
The logistic regression model then takes the logarithm of the odds (called logit) and models it as a linear function of explanatory variables (\(x_1, x_2, ..., x_n\)) (Equation 2).
\[ logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n \tag{2}\]
This is exactly the same as writing:
\[ p=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n)}} \]
In this model, the coefficients (\(\beta\)), which are estimated from the data, describe how the explanatory variables affect the log-odds of the outcome.
In this tutorial, we will aim to model the relationship between pesticide use (binary variable: Yes or No) and the number of agroecological practices (n_practices).
We first need to convert the pesticide variable into a binary variable (1 if the farm uses pesticides, 0 if the farm does not use pesticides).
In data, create a new binary variable (pesticide2) that is 1 or farms that use pesticides and 0 for farms that do not. Use dplyr::mutate() to create this new variable. You will also need to use a relational operator (see first tutorial) to write a condition that will allow R to categorize farms based on pesticide use. The function if_else() is also needed here as you can use it to tell R to assign the value 1 when pesticide > 0, but 0 otherwise.
Let’s start by visualizing the relationship between pesticide use (pesticide2, y-axis) and the number of agroecological practices (n_practices).
Create a scatter plot of the relationship between the number of agroecological practices (n_practices, x-axis) and pesticide use (pesticide2, y-axis).
ggplot(data = data,
mapping = aes(x = n_practices,
y = pesticide2)) +
geom_point(shape=21,
fill = "#FFCD00", #Fill colour for shape
colour = "black")+ #Border colour for shape
xlab("Number of agroecological practices")+ #Change name of x axis
ylab("Pesticide use")+ #Change name of y axis
theme_bw() #This creates a black/white plotTo fit a logistic regression model in R, we can use the glm() function (generalized linear model). You will need to consider three main arguments to use this function:
formula: a formula specifying the model. The formula must have the following form: response ~ predictor.
data: a data frame in which the variables specified in the formula will be found.
family: the error distribution and link function to be used in the model. Default is set to family = gaussian, which assumes that residuals follow a normal distribution (model results will then be identical to lm()). To fit a logistic regression model, you need to use family=binomial(link="logit"). This tells R that the response variable follows a binomial distribution and that a logistic link function should be used.
glm() to model the relationship between the number of agroecological practices (n_practices, x-axis) and pesticide use(pesticide2, y-axis) using a logistic regression. Store the model into an R object called model.summary() on the model object and inspect the summary table. Make sure you understand the information contained in this output.
Call:
glm(formula = pesticide2 ~ n_practices, family = binomial(link = "logit"),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.9311 4.2124 2.832 0.00462 **
n_practices -0.6768 0.2399 -2.822 0.00478 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 47.092 on 35 degrees of freedom
Residual deviance: 20.274 on 34 degrees of freedom
AIC: 24.274
Number of Fisher Scoring iterations: 7
Both the slope and the intercept of the model are significantly different from zero (p < 0.01).
The equation of the model is as follows:
\[ logit(p)=log(\frac{p}{1-p})=11.93-0.68 \times x_1 \]
The same equation could be rewritten as:
\[ p=\frac{1}{1+e^{-(11.93-0.68 \times x_1)}} \]
From the equations above, we can figure out that for the odds of using pesticides get multiplied by \(e^{-0.68}=0.51\) every time that one agroecological practice is added.
After fitting a model (for example with glm()), the confidence intervals of model coefficients can be obtained by applying confint() to the fitted model object. By default, this function calculates 95% confidence intervals for each model coefficient.
For logistic regression, the coefficients are expressed in log-odds. If we take the exponential of the confidence intervals using exp(), we obtain confidence intervals for the odds ratios.
confint() to calculate the lower and upper bounds of the 95% confidence interval for model coefficients.The odds ratio is equal to \(e^{-0.68}=0.51\) (see exercise 3). The 95% confidence interval around this odds ratio is [0.27, 0.73]. Because the confidence interval of the odds ratio lies entirely below 1, increasing the number of agroecological practices decreases the odds of using pesticides. In other words, the relationship between the number of practices and pesticide use is negative: farms with more agroecological practices are less likely to use pesticides. Since the confidence interval does not include 1, this indicates that the effect is statistically significant.
The goodness of fit of a logistic regression model can be assessed using Nagelkerke’s R2, a commonly used pseudo-R2 measure for models with a binary response variable. Unlike the R2 used in linear regression, logistic regression does not have a direct measure of the proportion of variance explained. Nagelkerke’s R2 provides a comparable measure that ranges between 0 and 1, where higher values indicate a better model fit.
In R, Nagelkerke’s R2 can be calculated using the NagelkerkeR2() function from the fmsb package. You just need to specify the model object as an input.
Nagelkerke’s R2 compares the fitted model with a null model (a model without predictors). A larger value indicates that the explanatory variables improve the model’s ability to predict the response variable.
What is the probability (\(p\)) that a farm uses pesticides if it uses a total of 17 agroecological practices? Calculate this probability by hand as well as in R.
The equation of the model is as follows:
\[ logit(p)=log(\frac{p}{1-p})=11.93-0.68 \times x_1 \]
The same equation could be rewritten as:
\[ p=\frac{1}{1+e^{-(11.9311-0.6768 \times x_1)}} \]
Therefore, the requested probability (\(p\)) can be calculated as:
\[ p=\frac{1}{1+e^{-(11.9311-0.6768 \times 17)}}=0.60 \]
If a farm uses 17 agroecological practices, there is a 60% chance that it uses pesticides.
An easy way to visualize the results of a logistic regression model is to use geom_smooth() in ggplot2. This function can automatically fit a model and display the predicted relationship between the response variable and a predictor.
For a binary response variable, you can specify a logistic regression model by using method = "glm" and method.args = list(family = "binomial").
Improve the scatter plot from exercise 2 by adding a line showing the fitted logistic model. In geom_smooth(), set se=TRUE to display the 95% confidence interval. Now that you have more experience with ggplot2, try to write the code yourself and do not look at the answer right away.
ggplot(data = data,
mapping = aes(x = n_practices,
y = pesticide2)) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = TRUE,
colour = "black")+
geom_point(shape=21,
fill = "#FFCD00", #Fill colour for shape
colour = "black")+ #Border colour for shape
xlab("Number of agroecological practices")+ #Change name of x axis
ylab("Pesticide use")+ #Change name of y axis
theme_bw() #This creates a black/white plot---
title: "Tutorial 11: Logistic regression"
author:
- name: Benjamin Delory
orcid: 0000-0002-1190-8060
email: b.m.m.delory@uu.nl
affiliations:
- name: Copernicus institute of sustainable development, Utrecht University
- name: Natalie Davis
orcid: 0000-0002-2678-0389
email: n.a.davis@vu.nl
affiliations:
- name: Amsterdam sustainability institute, Vrije Universiteit Amsterdam
- name: Heitor Mancini Teixeira
orcid: 0000-0001-6992-0671
email: heitor.teixeira@ufv.br
affiliations:
- name: Departamento de Solos Centro de Ciências Agrárias, Universidade Federal de Vicosa, Brazil
format: html
editor: visual
editor_options:
chunk_output_type: console
image: /Images/Rlogo.png
---
## About this tutorial
In this tutorial, you will learn how to fit a logistic regression model in R, interpret the model coefficients, and calculate predicted probabilities. We will also explore how to evaluate the performance of the model and assess how well it explains the observed data.
Let's get started!
## Install R packages
In this tutorial, we will need to use functions in the *fmsb* package. Make sure to install this package before starting the tutorial.
```{r}
#| eval: false
#| echo: true
#| message: false
#| warning: false
install.packages("fmsb")
```
## Load R packages
To work on this tutorial, you will need to load the *readxl*, *tidyverse*, and *fmsb* packages.
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
library(readxl)
library(tidyverse)
library(fmsb)
```
## Importing data
Start by importing the data used in this course (see previous tutorials).
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
#Option 1 (csv file)
data <- read_csv("gss_statistics_master_data_set2.csv")
#Option 2 (Excel file)
data <- read_excel("gss_statistics_master_data_set2.xlsx")
```
## What is logistic regression?
Logistic regression is a statistical method used to model the relationship between a response variable and one or more explanatory variables when the response variable follows a binomial distribution, such as binary variables (1 or 0, success or failure, etc.) or proportion data.
The logistic model describes how the probability of a binary outcome (e.g., presence/absence or success/failure) changes with one or more explanatory variables. Let ($p$) be the probability that the outcome equals 1 (e.g., probability that a species is present at a site, or probability of success). Instead of modeling this probability directly, the logistic model first expresses it as odds (@eq-odds).
$$
odds=\frac{p}{1-p}
$$ {#eq-odds}
The odds describe how likely the event is to occur ($p$) compared to not occurring ($1-p$). For example, if an event has a probability of success ($p$) of 0.75, the odds are 3 (0.75/0.25), meaning the event is three times as likely to occur as not occur.
The logistic regression model then takes the logarithm of the odds (called logit) and models it as a linear function of explanatory variables ($x_1, x_2, ..., x_n$) (@eq-logisticmodel).
$$
logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n
$$ {#eq-logisticmodel}
This is exactly the same as writing:
$$
p=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n)}}
$$
In this model, the coefficients ($\beta$), which are estimated from the data, describe how the explanatory variables affect the log-odds of the outcome.
## Research question
In this tutorial, we will aim to model the relationship between pesticide use (binary variable: Yes or No) and the number of agroecological practices (`n_practices`).
We first need to convert the `pesticide` variable into a binary variable (1 if the farm uses pesticides, 0 if the farm does not use pesticides).
::: callout-tip
## Exercise 1
In `data`, create a new binary variable (`pesticide2`) that is 1 or farms that use pesticides and 0 for farms that do not. Use `dplyr::mutate()` to create this new variable. You will also need to use a relational operator (see first tutorial) to write a condition that will allow R to categorize farms based on pesticide use. The function `if_else()` is also needed here as you can use it to tell R to assign the value 1 when `pesticide > 0`, but 0 otherwise.
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
data <- dplyr::mutate(data, pesticide2 = if_else(pesticide > 0, 1, 0))
```
:::
## Visualizing data
Let's start by visualizing the relationship between pesticide use (`pesticide2`, y-axis) and the number of agroecological practices (`n_practices`).
::: callout-tip
## Exercise 2
Create a scatter plot of the relationship between the number of agroecological practices (`n_practices`, x-axis) and pesticide use (`pesticide2`, y-axis).
```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false
#| fig-width: 4
#| fig-height: 4
#| fig-align: center
#| fig-cap: "Relationship between the number of agroecological practices and pesticide use."
#| label: fig-logistic1
ggplot(data = data,
mapping = aes(x = n_practices,
y = pesticide2)) +
geom_point(shape=21,
fill = "#FFCD00", #Fill colour for shape
colour = "black")+ #Border colour for shape
xlab("Number of agroecological practices")+ #Change name of x axis
ylab("Pesticide use")+ #Change name of y axis
theme_bw() #This creates a black/white plot
```
:::
## Fitting a logistic regression model
To fit a logistic regression model in R, we can use the `glm()` function (**generalized linear model**). You will need to consider three main arguments to use this function:
- `formula`: a formula specifying the model. The formula must have the following form: `response ~ predictor`.
- `data`: a data frame in which the variables specified in the formula will be found.
- `family`: the error distribution and link function to be used in the model. Default is set to `family = gaussian`, which assumes that residuals follow a normal distribution (model results will then be identical to `lm()`). To fit a logistic regression model, you need to use `family=binomial(link="logit")`. This tells R that the response variable follows a binomial distribution and that a logistic link function should be used.
:::: callout-tip
## Exercise 3
1. Use `glm()` to model the relationship between the number of agroecological practices (`n_practices`, x-axis) and pesticide use(`pesticide2`, y-axis) using a logistic regression. Store the model into an R object called `model`.
2. Run `summary()` on the `model` object and inspect the summary table. Make sure you understand the information contained in this output.
3. What can you conclude about the relationship between the number of agroecological practices and pesticide use? How do you interpret model coefficients?
::: panel-tabset
## Solution 1
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
model <- glm(pesticide2 ~ n_practices,
data,
family = binomial(link="logit"))
```
## Solution 2
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
summary(model)
```
## Solution 3
Both the slope and the intercept of the model are significantly different from zero (p \< 0.01).
The equation of the model is as follows:
$$
logit(p)=log(\frac{p}{1-p})=11.93-0.68 \times x_1
$$
The same equation could be rewritten as:
$$
p=\frac{1}{1+e^{-(11.93-0.68 \times x_1)}}
$$
From the equations above, we can figure out that for the odds of using pesticides get multiplied by $e^{-0.68}=0.51$ every time that one agroecological practice is added.
:::
::::
## Calculating confidence intervals
After fitting a model (for example with `glm()`), the confidence intervals of model coefficients can be obtained by applying `confint()` to the fitted `model` object. By default, this function calculates 95% confidence intervals for each model coefficient.
For logistic regression, the coefficients are expressed in log-odds. If we take the exponential of the confidence intervals using `exp()`, we obtain confidence intervals for the odds ratios.
:::: callout-tip
## Exercise 4
1. Use `confint()` to calculate the lower and upper bounds of the 95% confidence interval for model coefficients.
2. Calculate the confidence intervals for the odds ratio.
3. Is the confidence interval indicating that the effect of number of agroecological practices on pesticide use is positive or negative? Is the effect significant according to the confidence interval?
::: panel-tabset
## Solution 1
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
confint(model)
```
## Solution 2
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
exp(confint(model))
```
## Solution 3
The odds ratio is equal to $e^{-0.68}=0.51$ (see exercise 3). The 95% confidence interval around this odds ratio is \[0.27, 0.73\]. Because the confidence interval of the odds ratio lies entirely below 1, increasing the number of agroecological practices decreases the odds of using pesticides. In other words, the relationship between the number of practices and pesticide use is negative: farms with more agroecological practices are less likely to use pesticides. Since the confidence interval does not include 1, this indicates that the effect is statistically significant.
:::
::::
## Assessing model fit
The goodness of fit of a logistic regression model can be assessed using Nagelkerke’s R^2^, a commonly used pseudo-R^2^ measure for models with a binary response variable. Unlike the R^2^ used in linear regression, logistic regression does not have a direct measure of the proportion of variance explained. Nagelkerke’s R^2^ provides a comparable measure that ranges between 0 and 1, where higher values indicate a better model fit.
In R, Nagelkerke’s R^2^ can be calculated using the `NagelkerkeR2()` function from the *fmsb* package. You just need to specify the `model` object as an input.
Nagelkerke’s R^2^ compares the fitted model with a null model (a model without predictors). A larger value indicates that the explanatory variables improve the model’s ability to predict the response variable.
::: callout-tip
## Exercise 5
Calculate Nagelkerke’s R^2^ for the fitted logistic regression model (`model` object).
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
NagelkerkeR2(model)
```
:::
## Calculating probabilities from the model
:::: callout-tip
## Exercise 6
::: panel-tabset
## Question
What is the probability ($p$) that a farm uses pesticides if it uses a total of 17 agroecological practices? Calculate this probability by hand as well as in R.
## Answer (hand calculations)
The equation of the model is as follows:
$$
logit(p)=log(\frac{p}{1-p})=11.93-0.68 \times x_1
$$
The same equation could be rewritten as:
$$
p=\frac{1}{1+e^{-(11.9311-0.6768 \times x_1)}}
$$
Therefore, the requested probability ($p$) can be calculated as:
$$
p=\frac{1}{1+e^{-(11.9311-0.6768 \times 17)}}=0.60
$$
If a farm uses 17 agroecological practices, there is a 60% chance that it uses pesticides.
## Answer (R calculations)
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
p <- 1/(1+exp(-(11.9311-0.6768*17)))
```
If a farm uses 17 agroecological practices, there is a `{r} round(p, 2)*100`% chance that it uses pesticides.
:::
::::
## Visualizing model output
An easy way to visualize the results of a logistic regression model is to use `geom_smooth()` in *ggplot2*. This function can automatically fit a model and display the predicted relationship between the response variable and a predictor.
For a binary response variable, you can specify a logistic regression model by using `method = "glm"` and `method.args = list(family = "binomial")`.
::: callout-tip
## Exercise 7
Improve the scatter plot from exercise 2 by adding a line showing the fitted logistic model. In `geom_smooth()`, set `se=TRUE` to display the 95% confidence interval. Now that you have more experience with *ggplot2*, try to write the code yourself and do not look at the answer right away.
```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false
#| fig-width: 4
#| fig-height: 4
#| fig-align: center
#| fig-cap: "Relationship between the number of agroecological practices and pesticide use."
#| label: fig-logistic2
ggplot(data = data,
mapping = aes(x = n_practices,
y = pesticide2)) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = TRUE,
colour = "black")+
geom_point(shape=21,
fill = "#FFCD00", #Fill colour for shape
colour = "black")+ #Border colour for shape
xlab("Number of agroecological practices")+ #Change name of x axis
ylab("Pesticide use")+ #Change name of y axis
theme_bw() #This creates a black/white plot
```
:::