Tutorial 11: Logistic regression

Authors

Affiliations

Benjamin Delory

Copernicus institute of sustainable development, Utrecht University

Natalie Davis

Amsterdam sustainability institute, Vrije Universiteit Amsterdam

Heitor Mancini Teixeira

Departamento de Solos Centro de Ciências Agrárias, Universidade Federal de Vicosa, Brazil

About this tutorial

In this tutorial, you will learn how to fit a logistic regression model in R, interpret the model coefficients, and calculate predicted probabilities. We will also explore how to evaluate the performance of the model and assess how well it explains the observed data.

Let’s get started!

Install R packages

In this tutorial, we will need to use functions in the fmsb package. Make sure to install this package before starting the tutorial.

Show me the code

install.packages("fmsb")

Load R packages

To work on this tutorial, you will need to load the readxl, tidyverse, and fmsb packages.

Show me the code

library(readxl) 
library(tidyverse)
library(fmsb)

Importing data

Start by importing the data used in this course (see previous tutorials).

Show me the code

#Option 1 (csv file) 
data <- read_csv("gss_statistics_master_data_set2.csv")  

#Option 2 (Excel file) 
data <- read_excel("gss_statistics_master_data_set2.xlsx")

What is logistic regression?

Logistic regression is a statistical method used to model the relationship between a response variable and one or more explanatory variables when the response variable follows a binomial distribution, such as binary variables (1 or 0, success or failure, etc.) or proportion data.

The logistic model describes how the probability of a binary outcome (e.g., presence/absence or success/failure) changes with one or more explanatory variables. Let ($p$) be the probability that the outcome equals 1 (e.g., probability that a species is present at a site, or probability of success). Instead of modeling this probability directly, the logistic model first expresses it as odds (Equation 1).

\[ odds=\frac{p}{1-p} \tag{1}\]

The odds describe how likely the event is to occur ($p$) compared to not occurring ($1-p$). For example, if an event has a probability of success ($p$) of 0.75, the odds are 3 (0.75/0.25), meaning the event is three times as likely to occur as not occur.

The logistic regression model then takes the logarithm of the odds (called logit) and models it as a linear function of explanatory variables ($x_1, x_2, ..., x_n$) (Equation 2).

\[ logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n \tag{2}\]

This is exactly the same as writing:

\[ p=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n)}} \]

In this model, the coefficients ($\beta$), which are estimated from the data, describe how the explanatory variables affect the log-odds of the outcome.

Research question

In this tutorial, we will aim to model the relationship between pesticide use (binary variable: Yes or No) and the number of agroecological practices (n_practices).

We first need to convert the pesticide variable into a binary variable (1 if the farm uses pesticides, 0 if the farm does not use pesticides).

Exercise 1

In data, create a new binary variable (pesticide2) that is 1 or farms that use pesticides and 0 for farms that do not. Use dplyr::mutate() to create this new variable. You will also need to use a relational operator (see first tutorial) to write a condition that will allow R to categorize farms based on pesticide use. The function if_else() is also needed here as you can use it to tell R to assign the value 1 when pesticide > 0, but 0 otherwise.

Show me the code

data <- dplyr::mutate(data, pesticide2 = if_else(pesticide > 0, 1, 0))

Visualizing data

Let’s start by visualizing the relationship between pesticide use (pesticide2, y-axis) and the number of agroecological practices (n_practices).

Exercise 2

Create a scatter plot of the relationship between the number of agroecological practices (n_practices, x-axis) and pesticide use (pesticide2, y-axis).

Show me the code

ggplot(data = data, 
       mapping = aes(x = n_practices,
                     y = pesticide2)) +
  geom_point(shape=21,
             fill = "#FFCD00", #Fill colour for shape
             colour = "black")+ #Border colour for shape
  xlab("Number of agroecological practices")+ #Change name of x axis
  ylab("Pesticide use")+ #Change name of y axis
  theme_bw() #This creates a black/white plot

Figure 1: Relationship between the number of agroecological practices and pesticide use.

Fitting a logistic regression model

To fit a logistic regression model in R, we can use the glm() function (generalized linear model). You will need to consider three main arguments to use this function:

formula: a formula specifying the model. The formula must have the following form: response ~ predictor.
data: a data frame in which the variables specified in the formula will be found.
family: the error distribution and link function to be used in the model. Default is set to family = gaussian, which assumes that residuals follow a normal distribution (model results will then be identical to lm()). To fit a logistic regression model, you need to use family=binomial(link="logit"). This tells R that the response variable follows a binomial distribution and that a logistic link function should be used.

Exercise 3

Use glm() to model the relationship between the number of agroecological practices (n_practices, x-axis) and pesticide use(pesticide2, y-axis) using a logistic regression. Store the model into an R object called model.
Run summary() on the model object and inspect the summary table. Make sure you understand the information contained in this output.
What can you conclude about the relationship between the number of agroecological practices and pesticide use? How do you interpret model coefficients?

Show me the code

model <- glm(pesticide2 ~ n_practices,
             data,
             family = binomial(link="logit"))

Show me the code

summary(model)


Call:
glm(formula = pesticide2 ~ n_practices, family = binomial(link = "logit"), 
    data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  11.9311     4.2124   2.832  0.00462 **
n_practices  -0.6768     0.2399  -2.822  0.00478 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 47.092  on 35  degrees of freedom
Residual deviance: 20.274  on 34  degrees of freedom
AIC: 24.274

Number of Fisher Scoring iterations: 7

Both the slope and the intercept of the model are significantly different from zero (p < 0.01).

The equation of the model is as follows:

\[ logit(p)=log(\frac{p}{1-p})=11.93-0.68 \times x_1 \]

The same equation could be rewritten as:

\[ p=\frac{1}{1+e^{-(11.93-0.68 \times x_1)}} \]

From the equations above, we can figure out that for the odds of using pesticides get multiplied by $e^{-0.68}=0.51$ every time that one agroecological practice is added.

Calculating confidence intervals

After fitting a model (for example with glm()), the confidence intervals of model coefficients can be obtained by applying confint() to the fitted model object. By default, this function calculates 95% confidence intervals for each model coefficient.

For logistic regression, the coefficients are expressed in log-odds. If we take the exponential of the confidence intervals using exp(), we obtain confidence intervals for the odds ratios.

Exercise 4

Use confint() to calculate the lower and upper bounds of the 95% confidence interval for model coefficients.
Calculate the confidence intervals for the odds ratio.
Is the confidence interval indicating that the effect of number of agroecological practices on pesticide use is positive or negative? Is the effect significant according to the confidence interval?

Show me the code

confint(model)

                2.5 %    97.5 %
(Intercept)  5.687697 23.068457
n_practices -1.307524 -0.318642

Show me the code

exp(confint(model))

                  2.5 %       97.5 %
(Intercept) 295.2128188 1.043526e+10
n_practices   0.2704889 7.271358e-01

The odds ratio is equal to $e^{-0.68}=0.51$ (see exercise 3). The 95% confidence interval around this odds ratio is [0.27, 0.73]. Because the confidence interval of the odds ratio lies entirely below 1, increasing the number of agroecological practices decreases the odds of using pesticides. In other words, the relationship between the number of practices and pesticide use is negative: farms with more agroecological practices are less likely to use pesticides. Since the confidence interval does not include 1, this indicates that the effect is statistically significant.

Assessing model fit

The goodness of fit of a logistic regression model can be assessed using Nagelkerke’s R², a commonly used pseudo-R² measure for models with a binary response variable. Unlike the R² used in linear regression, logistic regression does not have a direct measure of the proportion of variance explained. Nagelkerke’s R² provides a comparable measure that ranges between 0 and 1, where higher values indicate a better model fit.

In R, Nagelkerke’s R² can be calculated using the NagelkerkeR2() function from the fmsb package. You just need to specify the model object as an input.

Nagelkerke’s R² compares the fitted model with a null model (a model without predictors). A larger value indicates that the explanatory variables improve the model’s ability to predict the response variable.

Exercise 5

Calculate Nagelkerke’s R² for the fitted logistic regression model (model object).

Show me the code

NagelkerkeR2(model)

$N
[1] 36

$R2
[1] 0.719826

Calculating probabilities from the model

Exercise 6

What is the probability ($p$) that a farm uses pesticides if it uses a total of 17 agroecological practices? Calculate this probability by hand as well as in R.

The equation of the model is as follows:

\[ logit(p)=log(\frac{p}{1-p})=11.93-0.68 \times x_1 \]

The same equation could be rewritten as:

\[ p=\frac{1}{1+e^{-(11.9311-0.6768 \times x_1)}} \]

Therefore, the requested probability ($p$) can be calculated as:

\[ p=\frac{1}{1+e^{-(11.9311-0.6768 \times 17)}}=0.60 \]

If a farm uses 17 agroecological practices, there is a 60% chance that it uses pesticides.

Show me the code

p <- 1/(1+exp(-(11.9311-0.6768*17)))

If a farm uses 17 agroecological practices, there is a 60% chance that it uses pesticides.

Visualizing model output

An easy way to visualize the results of a logistic regression model is to use geom_smooth() in ggplot2. This function can automatically fit a model and display the predicted relationship between the response variable and a predictor.

For a binary response variable, you can specify a logistic regression model by using method = "glm" and method.args = list(family = "binomial").

Exercise 7

Improve the scatter plot from exercise 2 by adding a line showing the fitted logistic model. In geom_smooth(), set se=TRUE to display the 95% confidence interval. Now that you have more experience with ggplot2, try to write the code yourself and do not look at the answer right away.

Show me the code

ggplot(data = data, 
       mapping = aes(x = n_practices,
                     y = pesticide2)) +
  geom_smooth(method = "glm",
              method.args = list(family = "binomial"),
              se = TRUE,
              colour = "black")+
  geom_point(shape=21,
             fill = "#FFCD00", #Fill colour for shape
             colour = "black")+ #Border colour for shape
  xlab("Number of agroecological practices")+ #Change name of x axis
  ylab("Pesticide use")+ #Change name of y axis
  theme_bw() #This creates a black/white plot

Figure 2: Relationship between the number of agroecological practices and pesticide use.

--- title: "Tutorial 11: Logistic regression" author: - name: Benjamin Delory orcid: 0000-0002-1190-8060 email: b.m.m.delory@uu.nl affiliations: - name: Copernicus institute of sustainable development, Utrecht University - name: Natalie Davis orcid: 0000-0002-2678-0389 email: n.a.davis@vu.nl affiliations: - name: Amsterdam sustainability institute, Vrije Universiteit Amsterdam - name: Heitor Mancini Teixeira orcid: 0000-0001-6992-0671 email: heitor.teixeira@ufv.br affiliations: - name: Departamento de Solos Centro de Ciências Agrárias, Universidade Federal de Vicosa, Brazil format: html editor: visual editor_options: chunk_output_type: console image: /Images/Rlogo.png --- ## About this tutorial In this tutorial, you will learn how to fit a logistic regression model in R, interpret the model coefficients, and calculate predicted probabilities. We will also explore how to evaluate the performance of the model and assess how well it explains the observed data. Let's get started! ## Install R packages In this tutorial, we will need to use functions in the *fmsb* package. Make sure to install this package before starting the tutorial. ```{r} #| eval: false #| echo: true #| message: false #| warning: false install.packages("fmsb") ``` ## Load R packages To work on this tutorial, you will need to load the *readxl*, *tidyverse*, and *fmsb* packages. ```{r} #| eval: true #| echo: true #| message: false #| warning: false library(readxl) library(tidyverse) library(fmsb) ``` ## Importing data Start by importing the data used in this course (see previous tutorials). ```{r} #| eval: true #| echo: true #| message: false #| warning: false #Option 1 (csv file) data <- read_csv("gss_statistics_master_data_set2.csv") #Option 2 (Excel file) data <- read_excel("gss_statistics_master_data_set2.xlsx") ``` ## What is logistic regression? Logistic regression is a statistical method used to model the relationship between a response variable and one or more explanatory variables when the response variable follows a binomial distribution, such as binary variables (1 or 0, success or failure, etc.) or proportion data. The logistic model describes how the probability of a binary outcome (e.g., presence/absence or success/failure) changes with one or more explanatory variables. Let ($p$) be the probability that the outcome equals 1 (e.g., probability that a species is present at a site, or probability of success). Instead of modeling this probability directly, the logistic model first expresses it as odds (@eq-odds). $$ odds=\frac{p}{1-p} $$ {#eq-odds} The odds describe how likely the event is to occur ($p$) compared to not occurring ($1-p$). For example, if an event has a probability of success ($p$) of 0.75, the odds are 3 (0.75/0.25), meaning the event is three times as likely to occur as not occur. The logistic regression model then takes the logarithm of the odds (called logit) and models it as a linear function of explanatory variables ($x_1, x_2, ..., x_n$) (@eq-logisticmodel). $$ logit(p)=log(\frac{p}{1-p})=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n $$ {#eq-logisticmodel} This is exactly the same as writing: $$ p=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n)}} $$ In this model, the coefficients ($\beta$), which are estimated from the data, describe how the explanatory variables affect the log-odds of the outcome. ## Research question In this tutorial, we will aim to model the relationship between pesticide use (binary variable: Yes or No) and the number of agroecological practices (`n_practices`). We first need to convert the `pesticide` variable into a binary variable (1 if the farm uses pesticides, 0 if the farm does not use pesticides). ::: callout-tip ## Exercise 1 In `data`, create a new binary variable (`pesticide2`) that is 1 or farms that use pesticides and 0 for farms that do not. Use `dplyr::mutate()` to create this new variable. You will also need to use a relational operator (see first tutorial) to write a condition that will allow R to categorize farms based on pesticide use. The function `if_else()` is also needed here as you can use it to tell R to assign the value 1 when `pesticide > 0`, but 0 otherwise. ```{r} #| eval: true #| echo: true #| message: false #| warning: false data <- dplyr::mutate(data, pesticide2 = if_else(pesticide > 0, 1, 0)) ``` ::: ## Visualizing data Let's start by visualizing the relationship between pesticide use (`pesticide2`, y-axis) and the number of agroecological practices (`n_practices`). ::: callout-tip ## Exercise 2 Create a scatter plot of the relationship between the number of agroecological practices (`n_practices`, x-axis) and pesticide use (`pesticide2`, y-axis). ```{r} #| eval: true #| echo: true #| warning: false #| message: false #| fig-width: 4 #| fig-height: 4 #| fig-align: center #| fig-cap: "Relationship between the number of agroecological practices and pesticide use." #| label: fig-logistic1 ggplot(data = data, mapping = aes(x = n_practices, y = pesticide2)) + geom_point(shape=21, fill = "#FFCD00", #Fill colour for shape colour = "black")+ #Border colour for shape xlab("Number of agroecological practices")+ #Change name of x axis ylab("Pesticide use")+ #Change name of y axis theme_bw() #This creates a black/white plot ``` ::: ## Fitting a logistic regression model To fit a logistic regression model in R, we can use the `glm()` function (**generalized linear model**). You will need to consider three main arguments to use this function: - `formula`: a formula specifying the model. The formula must have the following form: `response ~ predictor`. - `data`: a data frame in which the variables specified in the formula will be found. - `family`: the error distribution and link function to be used in the model. Default is set to `family = gaussian`, which assumes that residuals follow a normal distribution (model results will then be identical to `lm()`). To fit a logistic regression model, you need to use `family=binomial(link="logit")`. This tells R that the response variable follows a binomial distribution and that a logistic link function should be used. :::: callout-tip ## Exercise 3 1. Use `glm()` to model the relationship between the number of agroecological practices (`n_practices`, x-axis) and pesticide use(`pesticide2`, y-axis) using a logistic regression. Store the model into an R object called `model`. 2. Run `summary()` on the `model` object and inspect the summary table. Make sure you understand the information contained in this output. 3. What can you conclude about the relationship between the number of agroecological practices and pesticide use? How do you interpret model coefficients? ::: panel-tabset ## Solution 1 ```{r} #| eval: true #| echo: true #| message: false #| warning: false model <- glm(pesticide2 ~ n_practices, data, family = binomial(link="logit")) ``` ## Solution 2 ```{r} #| eval: true #| echo: true #| message: false #| warning: false summary(model) ``` ## Solution 3 Both the slope and the intercept of the model are significantly different from zero (p \< 0.01). The equation of the model is as follows: $$ logit(p)=log(\frac{p}{1-p})=11.93-0.68 \times x_1 $$ The same equation could be rewritten as: $$ p=\frac{1}{1+e^{-(11.93-0.68 \times x_1)}} $$ From the equations above, we can figure out that for the odds of using pesticides get multiplied by $e^{-0.68}=0.51$ every time that one agroecological practice is added. ::: :::: ## Calculating confidence intervals After fitting a model (for example with `glm()`), the confidence intervals of model coefficients can be obtained by applying `confint()` to the fitted `model` object. By default, this function calculates 95% confidence intervals for each model coefficient. For logistic regression, the coefficients are expressed in log-odds. If we take the exponential of the confidence intervals using `exp()`, we obtain confidence intervals for the odds ratios. :::: callout-tip ## Exercise 4 1. Use `confint()` to calculate the lower and upper bounds of the 95% confidence interval for model coefficients. 2. Calculate the confidence intervals for the odds ratio. 3. Is the confidence interval indicating that the effect of number of agroecological practices on pesticide use is positive or negative? Is the effect significant according to the confidence interval? ::: panel-tabset ## Solution 1 ```{r} #| eval: true #| echo: true #| message: false #| warning: false confint(model) ``` ## Solution 2 ```{r} #| eval: true #| echo: true #| message: false #| warning: false exp(confint(model)) ``` ## Solution 3 The odds ratio is equal to $e^{-0.68}=0.51$ (see exercise 3). The 95% confidence interval around this odds ratio is \[0.27, 0.73\]. Because the confidence interval of the odds ratio lies entirely below 1, increasing the number of agroecological practices decreases the odds of using pesticides. In other words, the relationship between the number of practices and pesticide use is negative: farms with more agroecological practices are less likely to use pesticides. Since the confidence interval does not include 1, this indicates that the effect is statistically significant. ::: :::: ## Assessing model fit The goodness of fit of a logistic regression model can be assessed using Nagelkerke’s R^2^, a commonly used pseudo-R^2^ measure for models with a binary response variable. Unlike the R^2^ used in linear regression, logistic regression does not have a direct measure of the proportion of variance explained. Nagelkerke’s R^2^ provides a comparable measure that ranges between 0 and 1, where higher values indicate a better model fit. In R, Nagelkerke’s R^2^ can be calculated using the `NagelkerkeR2()` function from the *fmsb* package. You just need to specify the `model` object as an input. Nagelkerke’s R^2^ compares the fitted model with a null model (a model without predictors). A larger value indicates that the explanatory variables improve the model’s ability to predict the response variable. ::: callout-tip ## Exercise 5 Calculate Nagelkerke’s R^2^ for the fitted logistic regression model (`model` object). ```{r} #| eval: true #| echo: true #| message: false #| warning: false NagelkerkeR2(model) ``` ::: ## Calculating probabilities from the model :::: callout-tip ## Exercise 6 ::: panel-tabset ## Question What is the probability ($p$) that a farm uses pesticides if it uses a total of 17 agroecological practices? Calculate this probability by hand as well as in R. ## Answer (hand calculations) The equation of the model is as follows: $$ logit(p)=log(\frac{p}{1-p})=11.93-0.68 \times x_1 $$ The same equation could be rewritten as: $$ p=\frac{1}{1+e^{-(11.9311-0.6768 \times x_1)}} $$ Therefore, the requested probability ($p$) can be calculated as: $$ p=\frac{1}{1+e^{-(11.9311-0.6768 \times 17)}}=0.60 $$ If a farm uses 17 agroecological practices, there is a 60% chance that it uses pesticides. ## Answer (R calculations) ```{r} #| eval: true #| echo: true #| message: false #| warning: false p <- 1/(1+exp(-(11.9311-0.6768*17))) ``` If a farm uses 17 agroecological practices, there is a `{r} round(p, 2)*100`% chance that it uses pesticides. ::: :::: ## Visualizing model output An easy way to visualize the results of a logistic regression model is to use `geom_smooth()` in *ggplot2*. This function can automatically fit a model and display the predicted relationship between the response variable and a predictor. For a binary response variable, you can specify a logistic regression model by using `method = "glm"` and `method.args = list(family = "binomial")`. ::: callout-tip ## Exercise 7 Improve the scatter plot from exercise 2 by adding a line showing the fitted logistic model. In `geom_smooth()`, set `se=TRUE` to display the 95% confidence interval. Now that you have more experience with *ggplot2*, try to write the code yourself and do not look at the answer right away. ```{r} #| eval: true #| echo: true #| warning: false #| message: false #| fig-width: 4 #| fig-height: 4 #| fig-align: center #| fig-cap: "Relationship between the number of agroecological practices and pesticide use." #| label: fig-logistic2 ggplot(data = data, mapping = aes(x = n_practices, y = pesticide2)) + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE, colour = "black")+ geom_point(shape=21, fill = "#FFCD00", #Fill colour for shape colour = "black")+ #Border colour for shape xlab("Number of agroecological practices")+ #Change name of x axis ylab("Pesticide use")+ #Change name of y axis theme_bw() #This creates a black/white plot ``` :::