19  Dangerous Driving

Drink-driving is one of the most prevailing causes of road accidents in New Zealand. New Zealand teenagers and young adults are over-represented in motor vehicle traffic injury and death statistics. It is important to investigate what influences the development of dangerous driving behaviours that may cause these accidents. This lesson investigates the different factors that may influence drink-driving (Driving While Impaired (DWI)) in New Zealand youths. The research was conducted by Pauline Gulliver (Injury Prevention Research Unit, University of Otago).

Data

Data Summary

1837 observations

11 variables

Variable Type Information
sex Categorical 2 levels: 1 (female), 2 (male).
agress15 Discrete Counts of aggressive behaviour at age 15.
agress18 Discrete Counts of aggressive behaviour at age 18.
drunkad Binary 1 (ever passenger in car with drunk adult driver at age 15), 0 (never passenger in car with drunk adult driver).
drunkad18 Binary 1 (ever passenger in car with drunk adult driver at age 18), 0 (never passenger in car with drunk adult driver).
drunktee Binary 1 (ever passenger in car with drunk teen driver at age 15), 0 (never passenger in car with drunk adult driver).
drtee18 Binary 1 (ever passenger in car with drunk teen driver at age 18), 0 (never passenger in car with drunk adult driver).
crash15 Binary 1 (involved in traffic accident in past 2 years at age 15), 0 (not involved in traffic accident in past 2 years).
crash18 Binary 1 (involved in traffic accident in past 2 years at age 18), 0 (not involved in traffic accident in past 2 years).
drnk_dr21 Binary 1 (drove after drinking too much between ages of 18 and 21), 0 (never drove after drinking too much between ages of 18 and 21).
per_safe21 Continuous Difference between perceived safe amount to drink before driving and estimated legal consumption limit.

There are 2 files associated with this presentation. The first contains the data you will need to complete the lesson tasks, and the second contains descriptions of the variables included in the data file.

Video

Important Information

This lesson uses analysis techniques for both count (proportion) and continuous data. Most of these have been explored in previous lessons so the code is initially hidden. It is recommended you leave the tasks below until after you have worked through the other lessons. These tasks can then function as revision exercises, with solutions available by revealing the code.

Objectives

Learning Objectives

This lesson provides the opportunity to recall and practice analysis techniques that have been previously demonstrated.

Reinforcing skills and concepts seen in earlier lessons:

  1. Data wrangling - read data, subsetting.

  2. Confidence intervals, hypothesis tests - chi-squared test, difference in proportions, difference in means.

  3. Logistic regression, ANOVA.

Tasks

0. Read data

0a. Read in the data

First make sure you have installed the package readxl and set the working directory.

Load the data into R.

Important Information

Name your data accidents for easier reference later.

The code has been hidden initially, so you can try to load the data yourself first before checking the solutions.

Code
#loads readxl package
library(readxl) 

#loads the data file and names it accidents
accidents<-read_xls("YouthAccidentData.xls") 

#view beginning of data frame
head(accidents)
Code
#loads readxl package
library(readxl) 
Warning: package 'readxl' was built under R version 4.2.2
Code
#loads the data file and names it accidents
accidents<-read_xls("YouthAccidentData.xls") 

#view beginning of data frame
head(accidents)
# A tibble: 6 × 11
    sex per_sa…¹ crash18 crash15 drunkad drunk…² drunk…³ drtee18 aggre…⁴ aggre…⁵
  <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     1     -2.5       1       0       1      NA       0       0       5       5
2     1     -2.5       0       0       0      NA       1       0       5       5
3     1     -3         0       1       0       0       1       1       5       5
4     1     -3         1      NA      NA       0      NA       1       5       5
5     1     -3         0       0       0      NA       1       0       5       5
6     1      0         0       0       1      NA       0       0       5       5
# … with 1 more variable: drnk_dr21 <dbl>, and abbreviated variable names
#   ¹​per_safe21, ²​drunkad18, ³​drunktee, ⁴​aggress15, ⁵​aggress18
# ℹ Use `colnames()` to see all variable names

This opens a data set from a study investigating the effect of alcohol consumption on youth motor accidents, focusing on drivers aged 15 to 21 years.

The variables recorded are sex (1=Female, 2=Male), per_safe21 (difference between estimated number of standard drinks they perceived to be ‘safe’ to consume before driving and the number of standard drinks participant could legally consume before driving \(*\)), crash18 (if respondents at age 18 had been in any traffic accidents in the last 3 years: 0=No and 1=Yes), crash15 (if respondents at age 15 had been in any traffic accidents in the last 2 years: 0=No and 1=Yes), drunkad (at age 15, if the respondent had been a passenger in a car with an adult driver that was considered to be over the legal limit*: 0=No and 1=Yes), drunkad18 (the same as drunkad but at age 18), drunktee (at age 15, if the respondent had been a passenger in a car with a youth/teenage driver that was considered to be over the legal limit), drtee18 (the same as drunktee but at age 18), aggress15 (frequency of aggressive behaviour at age 15), aggress18 (frequency of aggressive behaviour at age 18), and drnk_dr21 (if the respondent had driven after drinking too much between the ages 18- 21: 0=No and 1=Yes).

*legal limit referred to as 5 or more glasses of beer or wine, this has been lowered since the study was conducted.

Note that constants have been added to some variables to ensure anonymity for respondents. Additionally, due to confidentiality issues, the original data cannot be used. The dataset for this lesson is simulated data that produces the same results as the original.

1. Subsetting Data Frame, \(\chi^2\) Tests

Carry out chi-square tests for males to investigate the relationship between driving while impaired (DWI, variable name drnk_dr21) behaviour and various factors.

1a. Subsetting

Select the male respondents in the data set. We will be keeping this restriction throughout the lesson, so create a new data frame with only males sex=2. This is also a good opportunity to remove the NA values from the crash18 and drnk_dr21 variables.

Code
#subsetting the rows of the original data frame where sex=2 (males), removing rows with NAs
accidentsM<-accidents[accidents$sex==2&!is.na(accidents$crash18)&!is.na(accidents$drnk_dr21),]  
Code
#subsetting the rows of the original data frame where sex=2 (males), removing rows with NAs
accidentsM<-accidents[accidents$sex==2&!is.na(accidents$crash18)&!is.na(accidents$drnk_dr21),]  

1b. \(\chi^2\) Tests

Construct a table of counts and carry out the chi-square test for DWI (driving while impaired, drnk_dr21) with having crashed in the last 3 years at age 18 (crash18) using the male data.

Report your conclusion from the chi-square value and its associated p-value.

Code
table(accidentsM$crash18,accidentsM$drnk_dr21,dnn=c("Crash18","DWI"))

chisq.test(accidentsM$crash18,accidentsM$drnk_dr21)
Code
table(accidentsM$crash18,accidentsM$drnk_dr21,dnn=c("Crash18","DWI"))
       DWI
Crash18   0   1
      0 391  98
      1 110  13
Code
chisq.test(accidentsM$crash18,accidentsM$drnk_dr21)

    Pearson's Chi-squared test with Yates' continuity correction

data:  accidentsM$crash18 and accidentsM$drnk_dr21
X-squared = 5.3176, df = 1, p-value = 0.02111

The chi-squared p-value is 0.0211, so there is a significant relationship between driving while impaired and experiencing a traffic accident between the ages of 15 and 18.

Repeat the chi-squared test for DWI (drnk_dr21) with travelled with an impaired adult at age 15 (drunkad), and for DWI (drnk_dr21) with travelled with an impaired youth at age 18 (drtee18).

Report your conclusions and discuss any results of interest.

2. Confidence Interval, Hypothesis Test (difference in proportions)

Carry out a test for difference in proportions of males involved in a crash from ages 15-18 between those who had driven while impaired from ages 18-21 and those who had not.

Establish null and alternative hypotheses and interpret your result in the context of these. Do your conclusions from the confidence interval for the difference between the two proportions also line up with this?

Code
#first argument is the number of successes (in this case having a crash at age 18)
#second argument is number of trials
prop.test(c(length(which(accidentsM$crash18=="1"&accidentsM$drnk_dr21=="0")),
      length(which(accidentsM$crash18=="1"&accidentsM$drnk_dr21=="1"))),
      n=c(length(which(accidentsM$drnk_dr21=="0")),length(which(accidentsM$drnk_dr21=="1"))))
Code
#first argument is the number of successes (in this case having a crash at age 18)
#second argument is number of trials
prop.test(c(length(which(accidentsM$crash18=="1"&accidentsM$drnk_dr21=="0")),
      length(which(accidentsM$crash18=="1"&accidentsM$drnk_dr21=="1"))),
      n=c(length(which(accidentsM$drnk_dr21=="0")),length(which(accidentsM$drnk_dr21=="1"))))

    2-sample test for equality of proportions with continuity correction

data:  c(length(which(accidentsM$crash18 == "1" & accidentsM$drnk_dr21 == "0")), length(which(accidentsM$crash18 == "1" & accidentsM$drnk_dr21 == "1"))) out of c(length(which(accidentsM$drnk_dr21 == "0")), length(which(accidentsM$drnk_dr21 == "1")))
X-squared = 5.3176, df = 1, p-value = 0.02111
alternative hypothesis: two.sided
95 percent confidence interval:
 0.02699603 0.17789149
sample estimates:
   prop 1    prop 2 
0.2195609 0.1171171 

We have the null hypothesis that there is no difference in the proportion of males involved in a crash from ages 15-18 between those who had driven while impaired from ages 18-21 and those who had not, and the alternative hypothesis that this difference is not equal to 0.

The p-value is 0.0211, this provides significant evidence to reject the null hypothesis in favour of the alternative. We conclude there is a difference in the proportion of males who previously experienced a crash corresponding to whether they had recently driven while impaired.

The true proportion of males who had not driven while impaired that experienced a crash between ages 15 and 18 is estimated to be 0.0270 to 0.1779 higher than the true proportion of males who had driven while impaired that experienced a crash. This does not include 0 so matches the hypothesis test conclusion of a significant difference.

3. Confidence Interval, Hypothesis Test (difference in means)

Perform t-tests on the aggression variable for age 15 (aggress15) using the DWI (drnk_dr21) categories as the groups.

Report your findings from the t-test, with reference to the p-value and confidence interval.

Code
#first test if variances are equal
var.test(aggress15 ~ drnk_dr21, data=accidentsM, alternative = "two.sided") 

#significant evidence against null hypothesis that variances are equal, use var.equal=F in t test
t.test(accidentsM$aggress15[accidentsM$drnk_dr21=="0"],
accidentsM$aggress15[accidentsM$drnk_dr21=="1"],var.equal = F)
Code
#first test if variances are equal
var.test(aggress15 ~ drnk_dr21, data=accidentsM, alternative = "two.sided") 

    F test to compare two variances

data:  aggress15 by drnk_dr21
F = 0.24213, num df = 485, denom df = 110, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1780009 0.3204206
sample estimates:
ratio of variances 
         0.2421293 
Code
#significant evidence against null hypothesis that variances are equal, use var.equal=F in t test
t.test(accidentsM$aggress15[accidentsM$drnk_dr21=="0"],
accidentsM$aggress15[accidentsM$drnk_dr21=="1"],var.equal = F)

    Welch Two Sample t-test

data:  accidentsM$aggress15[accidentsM$drnk_dr21 == "0"] and accidentsM$aggress15[accidentsM$drnk_dr21 == "1"]
t = -1.5654, df = 122.42, p-value = 0.1201
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.61442934  0.07177557
sample estimates:
mean of x mean of y 
 5.368313  5.639640 

The null hypothesis is that there is no difference in the mean frequency of aggressive behaviour at age 15 between those who had driven while impaired from ages 18-21 and those who had not, and the alternative hypothesis that this difference is not equal to 0.

The p-value of the t-test is 0.1201, this provides no significant evidence to reject the null hypothesis in favour of the alternative.

The true mean frequency of aggressive behaviour at age 15 in males who had not driven while impaired is estimated with 95% confidence to be between 0.6144 lower and 0.0718 higher than the true mean frequency of aggressive behaviour of males who had driven while impaired. This confidence interval includes 0 so matches the hypothesis test conclusion of no significant difference.

Repeat the hypothesis test using drnk_dr21 categories with the aggression variable for age 18 (aggress18), and for the perceived number of standard drinks consumed to be safe (persafe_21).

Report your findings as above.

4. Logistic Regression, ANOVA

Important Information

This question is designed for students undertaking a first year statistics course at university.

Perform a logistic regression with DWI (drnk_dr21: 0=No, 1=Yes) as the response. Use sex, perceived safe number of drinks (per_safe21) and crash involvement at age 15 (crash15) as the predictors.

Write down the model for the regression of driving while impaired on sex, perceived safe number of drinks to consume (per_safe21) and crash rate at age 15 (crash15).

State any conclusions you have based on the model estimates and associated chi-squared p-values.

Code
DSCmodel<-glm(drnk_dr21~sex+crash15+per_safe21,data=accidents,family=binomial(link="logit"))
summary(DSCmodel)
anova(DSCmodel,test="Chisq")
Code
DSCmodel<-glm(drnk_dr21~sex+crash15+per_safe21,data=accidents,family=binomial(link="logit"))
summary(DSCmodel)

Call:
glm(formula = drnk_dr21 ~ sex + crash15 + per_safe21, family = binomial(link = "logit"), 
    data = accidents)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2031  -0.5670  -0.4615  -0.3854   2.3343  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.923201   0.354665  -8.242  < 2e-16 ***
sex          0.776756   0.200155   3.881 0.000104 ***
crash15     -0.007918   0.268532  -0.029 0.976476    
per_safe21   0.125563   0.022234   5.647 1.63e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 864.51  on 1048  degrees of freedom
Residual deviance: 805.86  on 1045  degrees of freedom
  (788 observations deleted due to missingness)
AIC: 813.86

Number of Fisher Scoring iterations: 5
Code
anova(DSCmodel,test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: drnk_dr21

Terms added sequentially (first to last)

           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                        1048     864.51              
sex         1   23.057      1047     841.45 1.573e-06 ***
crash15     1    0.005      1046     841.45    0.9441    
per_safe21  1   35.586      1045     805.86 2.441e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The regression equation for the fitted model may be written as follows

\[ \log \frac{p}{(1-p)} = -2.9232 + 0.7768_{SEX} - 0.0079_{CRASH15} + 0.1256_{PERSAFE21} \] where \(p\) is the probability of driving while impaired at age 21.

For males the odds of driving while impaired is increased by a multiplicative factor of \(exp(0.776756) = 2.1744\) compared to females. With a p-value of <0.001 this is a significant effect. The difference between perceived and safe number of drinks before driving is also significantly related to driving while impaired. For each one unit increase in the difference between perceived and actually safe quantities, the odds of driving while impaired increase by a multiplicative factor of \(exp(0.125563) = 1.1338\). There is no significant association between the indicator for experiencing a crash between ages 13 and 15 and the log odds of driving while impaired, so the model should be refitted with this predictor removed.

The analysis of deviance chi-squared values for the sex and per_safe21 variables are much less than 0.05, indicating a strongly significant increase in deviance if these were removed from the model. However the crash15 variable has a non-significant p-value of 0.9441, so should be removed from the model in the interests of parsimony (fitting the best model with as few parameters as possible).