11  Epidemiology

This lesson investigates whether childhood circumcision reduces the risk of acquiring genital herpes in men, providing a brief introduction to the field of epidemiology. It is presented by Dr Nigel Dickson (University of Otago Medical School).

Data

Data Summary

1890 observations

9 variables

Variable Type Information
ID Categorical Participant ID, not used for analysis.
CIRCUMCISED Binary 1 (circumcised at age 3), 0 (not circumcised at age 3).
HSV2POS Binary 1 (HSV-2 positive), 0 (HSV-2 negative).
UNDER_10_PART Binary 1 (less than 10 sexual partners), 0 (10+ sexual partners).
SOCIOECONOMIC Categorical 3 levels: 1 (high), 2 (medium), 3 (low).
SAMESEX Binary 1 (yes), 0 (no).
SCHOOLING Categorical 3 levels: 1 (high school or less), 2 (post secondary), 3 (university).
STI15 Binary 1 (sexually transmitted infection in last 5 years), 0 (no sexually transmitted infection in last 5 years).

There are 2 files associated with this presentation. The first contains the data you will need to complete the lesson tasks, and the second contains descriptions of the variables included in the data file.

Video

Objectives

Learning Objectives

New skills and concepts:

  1. Cross-tabulation for 2 and 3 variables.

  2. Confidence intervals, hypothesis tests for proportions and differences in proportions.

  3. Relative risk.

  4. Logistic regression for single predictor.

Reinforcing skills and concepts seen in earlier lessons:

  1. Read and format data.

  2. Experimental design.

Tasks

0. Read and Format data

0a. Read in the data

First check you have installed the package readxl (see Section 2.6) and set the working directory (see Section 2.1), using instructions in Getting started with R.

Load the data into R.

Previous Lesson

To load the data in R we run code analogous to Task 0 in Cockles Section 3.0.1

Code
#loads readxl package
library(readxl) 

#loads the data file and names it herpes
herpes<-read_xls("Herpes data.xls") 

 #view beginning of data frame
head(herpes)
Code
#loads readxl package
library(readxl) 
Warning: package 'readxl' was built under R version 4.2.2
Code
#loads the data file and names it herpes
herpes<-read_xls("Herpes data.xls") 

 #view beginning of data frame
head(herpes)
# A tibble: 6 × 9
     ID CIRCUMCISED HSV2POS UNDER_10_PART NUM_PART SOCIO…¹ SAMESEX SCHOO…²  STI5
  <dbl>       <dbl>   <dbl>         <dbl>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>
1     1           0       0             0       15       2       0       2     0
2     2           1       0             1        6       3       0       1     0
3     3           0       0             1        1       3       0       3     1
4     4           1       0             1        2       3       0       1     0
5     5           0       0             0       15       3       0       2     0
6     6           0       0             0       17       3       0       2     0
# … with abbreviated variable names ¹​SOCIOECONOMIC, ²​SCHOOLING

The data used for this study has been modified for teaching purposes by supposing a larger cohort of men than were originally recorded. 1890 men were followed from birth (1972), with the aim to investigate whether early childhood circumcision (by age 3) reduces the risk of herpes (the original cohort was approximately 1100 men).

The variables recorded are ID, CIRCUMCISED, HSV2POS (positive test for herpes), UNDER_10_PART (Under 10 sexual partners), NUM_PART (total number of sexual partners), SOCIOECONOMIC (socio-economic status), SAMESEX (same sex contact), SCHOOLING and STI5(sexually transmitted infection in the last 5 years). It should be noted that there are many variables recorded in this data set, however it is not necessary to use them all at one time. It is helpful to collect as much information as possible when conducting a study so that different hypotheses can be tested later on.

0b. Format the data

A lot of these variables are actually factors, the numbers represent categories or groups. R has classified them as numeric variables, so we will recode them as factors.

If you have already completed the earlier lessons focusing on continuous data, you will have seen some examples of numeric and character vectors that are actually factor variables. Try to identify and convert these yourself in the herpes data before revealing the code.

Code
herpes$CIRCUMCISED<-as.factor(herpes$CIRCUMCISED)
herpes$UNDER_10_PART<-as.factor(herpes$UNDER_10_PART)
herpes$SOCIOECONOMIC<-as.factor(herpes$SOCIOECONOMIC)

The variables SAMESEX, SCHOOLING and STI5 should also be recoded as factors.

Code
herpes$CIRCUMCISED<-as.factor(herpes$CIRCUMCISED)
herpes$UNDER_10_PART<-as.factor(herpes$UNDER_10_PART)
herpes$SOCIOECONOMIC<-as.factor(herpes$SOCIOECONOMIC)
herpes$CIRCUMCISED<-as.factor(herpes$CIRCUMCISED)
herpes$UNDER_10_PART<-as.factor(herpes$UNDER_10_PART)
herpes$SOCIOECONOMIC<-as.factor(herpes$SOCIOECONOMIC)
herpes$SAMESEX<-as.factor(herpes$SAMESEX)
herpes$SCHOOLING<-as.factor(herpes$SCHOOLING)
herpes$STI5<-as.factor(herpes$STI5)

1. Cross-Tabulation (2 variables)

The first step of the analysis is to get a cross-tabulation of some of the data we are looking at. Create a table of CIRCUMCISED and HSV2POS data.

Calculate the percentage of men that tested HSV2-positive (positive for herpes) for both the circumcised group and the non-circumcised group. What do you see?

Code
#create basic table of counts across the categories of 2 variables
#dnn= c() title vector has 2 entries
#specified in the same order the variables were provided to the table function
CHVtable<-table(herpes$CIRCUMCISED,herpes$HSV2POS,dnn=c("Circumcised","HSV2POS"))

#adds row and column sums
addmargins(CHVtable) 
Code
#create basic table of counts across the categories of 2 variables
#dnn= c() title vector has 2 entries
#specified in the same order the variables were provided to the table function
CHVtable<-table(herpes$CIRCUMCISED,herpes$HSV2POS,dnn=c("Circumcised","HSV2POS"))

#adds row and column sums
addmargins(CHVtable) 
           HSV2POS
Circumcised    0    1  Sum
        0   1022   98 1120
        1    722   48  770
        Sum 1744  146 1890
Code
(98/1120)*100
[1] 8.75
Code
(48/770)*100
[1] 6.233766

8.75% of men in the non-circumcised group tested positive for herpes, slightly more than the 6.23% of men in the circumcised group who tested positive.

2. Confidence Intervals, Hypothesis Tests (proportions, difference in proportions)

2a. Confidence Intervals, Hypothesis Tests (proportions)

Construct 95% confidence intervals for the prevalence (proportions) at age 26 of herpes HVS2POS for both the non-circumcised and circumcised groups.

What are your conclusions from the two confidence intervals and hypothesis tests?

It is most efficient to calculate the confidence intervals in R using the prop.test() function. This is similar to the t.test() function we have previously used for continuous data, the only difference is it is used for proportions.

Code
#first argument is the number of successes (in this case having herpes and not being circumcised)
#second argument is number of trials
#note the use of length to sum the counts
prop.test(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0")),
          n=length(which(herpes$CIRCUMCISED=="0")))

From the output above we have a 95% confidence interval for the true proportion of non-circumcised men who have herpes at age 26. Along with this interval, prop.test() provides a best estimate of the proportion which you can multiply by 100 to check your earlier percentage estimate.

We also have the result of a hypothesis test, where the null hypothesis is that the true proportion of non-circumcise men with herpes at age 26 is equal to 0.5.

It is most efficient to calculate the confidence intervals in R using the prop.test() function. This is similar to the t.test() function used for continuous data, the only difference is it is used for proportions.

Code
#first argument is the number of successes (in this case having herpes and not being circumcised)
#second argument is number of trials
#note the use of length to sum the counts
prop.test(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0")),
          n=length(which(herpes$CIRCUMCISED=="0")))

    1-sample proportions test with continuity correction

data:  length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "0")) out of length(which(herpes$CIRCUMCISED == "0")), null probability 0.5
X-squared = 760.65, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.07192269 0.10597295
sample estimates:
     p 
0.0875 

We can be 95% confident that the true proportion of non-circumcised men who have herpes at age 26 lies between 0.0719 and 0.1060, with a best estimate of 0.0875 (8.75%).

The corresponding hypothesis test has a p-value of < 2.2e-16, giving us significant evidence to reject the null hypothesis in favour of the alternative and conclude that the true proportion of non-circumcised men with herpes at age 26 is not equal to 0.5 (i.e. chance).

Construct a confidence interval and hypothesis test for the proportion of men with herpes in the circumcised group.

2b. Confidence Interval and Hypothesis Test (difference in proportions)

Construct a 95% confidence interval for the difference in HSV2POS proportions according to the CIRCUMCISED variable.

What can you conclude from the confidence interval and hypothesis test for the difference between the two proportions? Are any of the results that you have found so far significant?

This follows a similar, but slightly more involved procedure as the confidence intervals for proportions.

Code
#First argument is a vector of successes for each of the groups to be compared
#second argument is a vector of the number of trials for each group

prop.test(c(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0")),length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="1"))),
  n=c(length(which(herpes$CIRCUMCISED=="0")),length(which(herpes$CIRCUMCISED=="1"))))
Code
#First argument is a vector of successes for each of the groups to be compared
#second argument is a vector of the number of trials for each group

prop.test(c(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0")),length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="1"))),
  n=c(length(which(herpes$CIRCUMCISED=="0")),length(which(herpes$CIRCUMCISED=="1"))))

    2-sample test for equality of proportions with continuity correction

data:  c(length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "0")), length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "1"))) out of c(length(which(herpes$CIRCUMCISED == "0")), length(which(herpes$CIRCUMCISED == "1")))
X-squared = 3.7077, df = 1, p-value = 0.05416
alternative hypothesis: two.sided
95 percent confidence interval:
 0.0002870711 0.0500376042
sample estimates:
    prop 1     prop 2 
0.08750000 0.06233766 

We can be 95% confident that the true difference in proportions of non-circumcised and circumcised men with herpes at age 26 lies between 0.0003 and 0.0500, with a best estimate of 0.0252 (2.52%). This confidence interval is entirely (just) above 0, suggesting that the proportion of non-circumcised men with herpes is higher than the proportion of circumcised men.

The corresponding hypothesis test has a p-value of 0.0542, this gives us borderline significant evidence to reject the null hypothesis in favour of the alternative but is still greater than 0.05. From the hypothesis test we would therefore conclude that the true difference in proportions of non-circumcised and circumcised men with herpes at age 26 is not significantly different to 0.

Important Information

Standard error is calculated slightly differently for hypothesis tests (which assume the null is true) and confidence intervals. Therefore, if there is not strong evidence in a particular direction sometimes these tests can reach conflicting conclusions.

In this instance the confidence interval only just excludes 0, and the probability of the observed difference in proportions occurring under the null hypothesis is only just above 0.05. The two tests fall on alternate sides of their imposed cut-offs, representing an edge case where it is difficult to determine if we have just enough evidence in favour of a difference or not quite enough.

3. Relative Risk

Relative risk (RR) is the chance of an event (in this case the development of herpes (HSV2POS)) occurring relative to exposure. I.e. it is a ratio of the probability of an event occurring in the exposed group relative to that in the non-exposed group (here ‘exposure’ is circumcision (CIRCUMCISED=1)).

To calculate the relative risk for herpes in the circumcised group relative to the non-circumcised group we use the table we produced in Task 1. We use the formula:

\[ \frac{p_{\text{circumcised}}}{p_{\text{non-circumcised}}}\]

where \(p_{\text{circumcised}}\) is the probability of testing HSV2-positive in the circumcised group and \(p_{\text{non-circumcised}}\) is the probability of testing HSV2-positive in the non-circumcised group.

Calculate the above relative risk and state your conclusion in relation to the effect of CIRCUMCISED on the likelihood of contracting herpes HSV2POS.

Important Information

The confidence intervals for the relative risks are outside the school syllabus. They are not symmetrical about the point estimate but have an interpretation which is similar to other confidence intervals. The value 1 for a relative risk is the crucial decision value because then the two risks are equal.

Code
pncirc<-(98/1120)
pcirc<-(48/770)

rr<-pcirc/pncirc
rr
[1] 0.7124304

The risk of testing positive for herpes at age 26 in the circumcised group is 0.7124 times the risk of testing positive for herpes in the non-circumcised group. As this value is less than 1 (1 is equal risk) this suggests there is a reduced herpes risk for men who are circumcised, but it is not a dramatic decrease.

To make a definitive conclusion about the relative risk of herpes in circumcised and non-circumcised men a confidence interval would need to be calculated. This accounts for the uncertainty that is inherent in our above estimate due to using a sample rather than the entire population of men aged 26. If the confidence interval included 1 there would be no significant difference in the relative herpes risk, if it was entirely less than 1 we would conclude the risk is lower for circumcised men, if it was entirely greater than 1 (not possible in this case as our estimate, which makes up the mid-point of the interval, is less than 1) we would conclude the risk is higher for circumcised men.

4. Experimental Design

Discuss the importance of a designed (experimental) study as opposed to an observational study.

The discussion should cover adjustment for confounders either by stratification (as is described in the accompanying video) or by constructing a designed (clinical) trial to control confounders.

See video Section 11.1

An experimental study is able to establish a causal relationship between the predictors and response of interest. Observational studies can show a correlation, but it is not possible to conclude causality as this may instead be due to confounding variables.

Confounding variables have an association with both the predictor and response values so can obscure their true relationship.

Adjustment for potential confounders can be carried out by performing separate analysis of the effect of predictor variables on the response for different levels of the confounding variable. This allows us to see the relationship between the predictor and response within similar confounding values, and takes into account the confounding variable to test for a true effect.

In a designed clinical trial participants are randomly assigned to receive either the treatment of interest or the control, so the impact of potential confounders is reduced as they can no longer influence whether a participant receives the treatment or not. As circumcision is a surgical procedure with relatively permanent effects it is not ethical to carry out an experimental study when this is the treatment of interest.

5. Cross-Tabulation (3 variables)

5a. Cross-Tabulation

Stratify the groups by the number of sexual partners. We need a categorical variable for stratification so use UNDER_10_PART (under 10 sexual partners, yes/no).

Add the stratifying binary variable UNDER_10_PART to the contingency table from Task 1. Try this yourself and then reveal the code.

Previous Task

See Task 1 for the 2 variable version of this table.

Code
CHV10table<-table(herpes$CIRCUMCISED,herpes$HSV2POS,herpes$UNDER_10_PART,
                  dnn=c("Circumcised","HSV2POS","Under 10 Partners"))

addmargins(CHV10table)
Code
CHV10table<-table(herpes$CIRCUMCISED,herpes$HSV2POS,herpes$UNDER_10_PART,
                  dnn=c("Circumcised","HSV2POS","Under 10 Partners"))

addmargins(CHV10table)
, , Under 10 Partners = 0

           HSV2POS
Circumcised    0    1  Sum
        0    552   74  626
        1    186   24  210
        Sum  738   98  836

, , Under 10 Partners = 1

           HSV2POS
Circumcised    0    1  Sum
        0    470   24  494
        1    536   24  560
        Sum 1006   48 1054

, , Under 10 Partners = Sum

           HSV2POS
Circumcised    0    1  Sum
        0   1022   98 1120
        1    722   48  770
        Sum 1744  146 1890

5b. Calculate Proportions

Calculate the proportion of men that tested positive for HSV2 for each of the four combinations by hand.

Not Circumcised, ≥ 10 Partners:

Code
74/626

Not Circumcised, ≥ 10 Partners:

Code
74/626
[1] 0.1182109

Not Circumcised, < 10 Partners:

Code
24/494
[1] 0.048583

Circumcised, ≥ 10 Partners:

Code
24/210
[1] 0.1142857

Circumcised, < 10 Partners:

Code
24/560
[1] 0.04285714

There appears to be a higher proportion of men with more than 10 partners testing positive for herpes compared to those with less than 10, across both the circumcised and non-circumcised groups.

6. Practice: Confidence Intervals, Hypothesis Tests (proportions, difference in proportions)

6a. Confidence Intervals, Hypothesis Tests (proportions)

Calculate the confidence intervals for the proportions in Task 5.

Previous Task

This is similar to the calculation in Task 2a. but has a third condition.

Discuss any significant results that you see from the confidence intervals and hypothesis tests for the four proportions.

The calculation of the proportion,confidence interval, and hypothesis test for men with HSV2 that are not circumcised and have had greater than or equal to 10 sexual partners is shown below.

Code
prop.test(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="0")),
    n=length(which(herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="0")))

The null hypothesis is that the proportion of non-circumcised men with 10 or more partners who have herpes is equal to 0.5, and the alternative hypothesis that this proportion is not equal to 0.5.

Code
prop.test(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="0")),
    n=length(which(herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="0")))

    1-sample proportions test with continuity correction

data:  length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "0" & herpes$UNDER_10_PART == "0")) out of length(which(herpes$CIRCUMCISED == "0" & herpes$UNDER_10_PART == "0")), null probability 0.5
X-squared = 363.46, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.09449736 0.14672750
sample estimates:
        p 
0.1182109 
Code
prop.test(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="0")),
    n=length(which(herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="0")))

    1-sample proportions test with continuity correction

data:  length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "1" & herpes$UNDER_10_PART == "0")) out of length(which(herpes$CIRCUMCISED == "1" & herpes$UNDER_10_PART == "0")), null probability 0.5
X-squared = 123.43, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.07605235 0.16712714
sample estimates:
        p 
0.1142857 
Code
prop.test(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="1")),
    n=length(which(herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="1")))

    1-sample proportions test with continuity correction

data:  length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "0" & herpes$UNDER_10_PART == "1")) out of length(which(herpes$CIRCUMCISED == "0" & herpes$UNDER_10_PART == "1")), null probability 0.5
X-squared = 400.86, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.03204103 0.07245579
sample estimates:
       p 
0.048583 
Code
prop.test(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="1")),
    n=length(which(herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="1")))

    1-sample proportions test with continuity correction

data:  length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "1" & herpes$UNDER_10_PART == "1")) out of length(which(herpes$CIRCUMCISED == "1" & herpes$UNDER_10_PART == "1")), null probability 0.5
X-squared = 466.29, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.02824317 0.06402501
sample estimates:
         p 
0.04285714 

All proportions are significantly different from chance (0.5).

Previous Task

Confidence intervals can be interpreted in the same way as Task 2a.

6b. Confidence Intervals, Hypothesis Tests (difference in proportions)

Calculate the confidence interval for the difference between proportions.

Previous Task

This is similar to the calculation in Task 2b. but has a third condition.

What conclusions can you draw from your confidence intervals? Discuss any significant (or non-significant) differences that you see and compare these with the results found in Task 2.

What effect has the inclusion of ‘number of sexual partners’ had on the conclusions that were made?

The procedure for the difference between proportions for the circumcised group with \(\geq 10\) sexual partners and the non-circumcised group with \(\geq 10\) sexual partners is shown.

Important Information

The non-circumcised group is group 1 (p1) and the circumcised group is group 2 (p2) to allow direct comparisons with the confidence intervals found in Task 2. The same results will be produced (with opposite signs) and conclusions will not change if this is executed the opposite way.

Code
#for a difference in proportions we supply vectors to the arguments for number of successes and number of trials
prop.test(c(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="0")),length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="0"))),
n=c(length(which(herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="0")),length(which(herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="0"))))

The null hypothesis is that for men with more than 10 partners the difference in the proportion of circumcised and non-circumcised men with herpes is equal to 0, and the alternative hypothesis that this proportion is not equal to 0. The standard error formula for this test uses the pooled variance.

Repeat this procedure for the difference between proportions for the circumcised group with <10 sexual partners and the non-circumcised group with <10 sexual partners.

Code
#for a difference in proportions we supply vectors to the arguments for number of successes and number of trials
prop.test(c(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="0")),length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="0"))),
n=c(length(which(herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="0")),length(which(herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="0"))))

    2-sample test for equality of proportions with continuity correction

data:  c(length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "0" & herpes$UNDER_10_PART == "0")), length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "1" & herpes$UNDER_10_PART == "0"))) out of c(length(which(herpes$CIRCUMCISED == "0" & herpes$UNDER_10_PART == "0")), length(which(herpes$CIRCUMCISED == "1" & herpes$UNDER_10_PART == "0")))
X-squared = 0.00084447, df = 1, p-value = 0.9768
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.04916763  0.05701792
sample estimates:
   prop 1    prop 2 
0.1182109 0.1142857 
Code
#for a difference in proportions we supply vectors to the arguments for number of successes and number of trials
prop.test(c(length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="1")),length(which(herpes$HSV2POS=="1"&herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="1"))),
n=c(length(which(herpes$CIRCUMCISED=="0"&herpes$UNDER_10_PART=="1")),length(which(herpes$CIRCUMCISED=="1"&herpes$UNDER_10_PART=="1"))))

    2-sample test for equality of proportions with continuity correction

data:  c(length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "0" & herpes$UNDER_10_PART == "1")), length(which(herpes$HSV2POS == "1" & herpes$CIRCUMCISED == "1" & herpes$UNDER_10_PART == "1"))) out of c(length(which(herpes$CIRCUMCISED == "0" & herpes$UNDER_10_PART == "1")), length(which(herpes$CIRCUMCISED == "1" & herpes$UNDER_10_PART == "1")))
X-squared = 0.088153, df = 1, p-value = 0.7665
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.02149373  0.03294544
sample estimates:
    prop 1     prop 2 
0.04858300 0.04285714 
Previous Task

Confidence intervals can be interpreted in the same way as Task 2b.

There are no longer significant differences in the proportion of non-circumcised and circumcised men with herpes at age 26 after stratifying by number of partners. This shows number of partners was a confounding variable (affecting both likelihood of circumcision and likelihood of herpes) that influenced our earlier conclusion.

7. Logistic Regression

Important Information

This question is designed for students undertaking a first year statistics course at university and contains content beyond the school syllabus.

Perform a logistic regression with HSV2POS (herpes positive or negative) as the response and CIRCUMCISED (yes or no) as the predictor.

Write down the model for the regression of circumcision on herpes. State any conclusions you have based on the model estimates and discuss these with reference to the Chi-squared value.

Use the glm() function to fit a generalised linear model in R. By specifying the family as binomial (binary response) with a logit link, we perform a logistic regression.

Code
#fit model using formula response~predictor
CHVmodel<-glm(HSV2POS~CIRCUMCISED,data=herpes,family=binomial(link="logit"))

summary(CHVmodel)

#perform ANOVA test for significance of predictor
anova(CHVmodel,test="Chisq")
Code
#fit model using formula response~predictor
CHVmodel<-glm(HSV2POS~CIRCUMCISED,data=herpes,family=binomial(link="logit"))

summary(CHVmodel)

Call:
glm(formula = HSV2POS ~ CIRCUMCISED, family = binomial(link = "logit"), 
    data = herpes)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4279  -0.4279  -0.4279  -0.3588   2.3559  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.3445     0.1057 -22.171   <2e-16 ***
CIRCUMCISED1  -0.3663     0.1828  -2.004   0.0451 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1028.2  on 1889  degrees of freedom
Residual deviance: 1024.0  on 1888  degrees of freedom
AIC: 1028

Number of Fisher Scoring iterations: 5
Code
#perform ANOVA test for significance of predictor
anova(CHVmodel,test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: HSV2POS

Terms added sequentially (first to last)

            Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
NULL                         1889     1028.2           
CIRCUMCISED  1   4.1474      1888     1024.0   0.0417 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The regression equation for the fitted model may be written as follows

\[ \log \frac{p}{(1-p)} = -2.3445 - 0.3663_{CIRCUMCISED} \] where \(p\) is the probability for males of testing positive for herpes at age 26.

For males the odds of testing positive for herpes if circumcised is decreased by a multiplicative factor of \(exp(- 0.3663) = 0.6933\) compared to non-circumcised men. With a p-value of 0.0451 this is a (just) significant effect.

The analysis of deviance chi-squared value for the CIRCUMCISED variable is 0.0417, indicating a borderline significant increase in deviance if it was removed from the model. Therefore, it has some explanatory power and may indicate a significant effect. However, based on our earlier findings it may be useful to take into account some potentially confounding variables in our model and re-assess our conclusions.

Perform a logistic regression with HSV2POS as the response and CIRCUMCISED + UNDER_10_PART as the predictors.

Write down the second model and state any new conclusions that arise. Discuss the parameter estimates and chi-squared value with reference to the results you found in 7.

Do these results agree with those found earlier in the lesson?