ETC2420 - Tutorial solution (W1-3) CDF; Hypo Testing; Violin Plot

type

Post

Created date

Aug 18, 2021 11:43 PM

Week 1 CDF

How to draw random number with distribution

Poisson :

rpois(n = 5, lambda = 1)

T-dist :

rt(20, df = 10)

ND :

rnorm(1, mean = 3, sd = 2)

Note : the mean of these draws is not exactly the same as declared. So, you cannot guarantee that the mean is exactly 3.

Simulation

Q1 :

Supposed that we want to explore how variable the sample mean of 100 normally distributed samples is. So, first, we Simulate 100 variables. Then Compute the sample mean.

Run the experiment 100,000 times and make a histogram of the sampling distribution of the sample mean.


# 1. Run the experiment 100,000 times
N <- 100000
draws_combined <- tibble(
  experiment = rep(1:N, each = 100),
  draw = rnorm(N * 100, 3, 1)
)
# 2. sampling distribution of the sample mean. 
sampling_distribution <- draws_combined %>% 
         group_by(experiment) %>% 
         summarise(sample_mean = mean(draw))

Step 1, 2


# make a histogram
sampling_distribution %>% ggplot(aes(sample_mean)) +
  geom_histogram(fill = "purple", bins = 30) + 
  theme_bw() +
  xlab("Sample mean")

# The version of making a bell curve.
sampling_distribution %>% ggplot(aes(sample_mean)) +
  geom_histogram(aes(y = after_stat(density)), 
                 fill = "purple",
                 bins = 50) +
  stat_function(fun = function(x) dnorm(x, 3,sqrt(1/100)), 
                colour = "red", linetype = "dashed", size = 1.2) +
  theme_bw() +
  xlab("Sample mean")

Step 3 : Histogram

Q1.1

For the same experiment, plot the sampling distribution of the largest observation in the sample.

sampling_distribution <- draws_combined %>%
group_by(experiment) %>%
summarise(sample_mean = max(draw)) #take the largest observation

#make a histogram

sampling_distribution %>% ggplot(aes(sample_mean)) +
geom_histogram(fill = "purple", bins = 30) + theme_bw() +
xlab("Sample standard deviation")

CDF & eCDF - Visualise more complex uncertainty

Q1 . Test if the data is following a ND, using comparison with CDF

tibble(sample = data) %>% ggplot(aes(x = sample)) +
stat_ecdf() +
stat_function(fun = pnorm, colour = "blue",
linetype = "dashed")

The empirical CDF is not close enough to the theoretical one to say that it's from the same distribution.

Q2. Now, working the same thing for 100 times.

#Data

set.seed(301)
tibble(draw = rt(20,df= 10)) %>%
ggplot(aes(x = draw)) + stat_ecdf() +
stat_function(fun = pnorm, colour = "blue",
linetype = "dashed")

#Visualisation

ggplot(experiments, aes(x = draw)) +
stat_ecdf(aes(group = experiment),
colour = "lightgrey", 
alpha = 10/100) + 
stat_ecdf(data = tibble(real_data = data),
aes(x = real_data)) +
stat_function(fun = pnorm, colour = "blue",
linetype = "dashed") +
theme_classic() +
xlab("x")

Conclusion : What we have shown is that we cannot distinguish between the sample and 20 draws from a standard ND. — Conclusion : What we have shown is that *we cannot distinguish between the sample and 20 draws from a standard ND.*

Notes on visualisation code

Making the empirical CDFs a lighter colour is useful to make them less prominent.

The alpha parameter makes the lines semi-transparent, helping reduce their intensity.

This parameter is difficult to set and you should try a bunch of different values. (alpha = 1 is the default value, and alpha = 0 is fully transparent.)

Q3 . The difference when the experiment has 2000 observations instead of 20

#data

N <- 100
data <- rt(2000, df = 10)
experiments <- tibble(experiment = rep(1:N, each = 2000),
draw = rnorm(N * 2000, 0, 1))

#Visualisation

ggplot(experiments, aes(x = draw)) +
stat_ecdf(aes(group = experiment),
colour = "lightgrey",
alpha = 10/100) +
stat_ecdf(data = tibble(real_data = data),
aes(x = real_data)) +
stat_function(fun = pnorm, colour = "blue",
linetype = "dashed") +
theme_classic() +
xlab("x")

Q4. The difference between 20 samples from a T-distribution with 5 degrees of freedom and a N(0,1) distribution

#data

N <- 100

data <- rt(20, df = 5)
experiments <- tibble(experiment = rep(1:N, each = 20),
draw = rnorm(N * 20, 0, 1))

#Visualisation

ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment),
colour = "lightgrey", alpha = 10/100) +
stat_ecdf(data = tibble(real_data = data),
aes(x = real_data)) +
stat_function(fun = pnorm, colour = "blue",
linetype = "dashed") +
theme_classic() +
xlab("x")

Week 2 Hypo testing

Wilcoxon test - nonparametric

If we do not assume anything about the underlying distribution

( that is we do not know if it is a ND ), test if we could reject the null hypo.

null hypo :

1. the diff of marks is a ND;

2. quiz1 and quiz2 shares the same dist ;

3. diff = 0

It turns out p (≤ 0.05), so There is strong evidence that the null hypothesis can be rejected.

That is, quiz1 and quiz2 does not share the same dist.

Shapiro Wilks test - Test if the population is ND

So, if the result of shapiro.test ≤ 0.5 , we assume it is ND and can use T-test(), else Wilcoxon.Test()

Q1 . Overall test.

Although her full time work is focused on strapping ice skates to tiny kittens, she is pretty good at statistics and immediately sees a problem:

she doesn’t know what type of null distributions to check against. So she decides to check 3 diﬀerent distributions that all have the same mean.

Both test scores come independently from N(50, 252) distributions

Both test scores come independently from Uniform(10, 90) distributions

Both test scores come independently from Exp(1/50) distributions

Her reasoning is that the ﬁrst one is the best case for the tests, the second corresponds to students who are at all levels, and the third corresponds to more students getting low marks than high marks.

Q1.1 : Make a new function hypothesis_test() that implements Ambrose's proposal.


hypothesis_test <- function(x) {
  shapiro <- shapiro.test(x)
  is_normal <- (shapiro$p.value > 0.05)
  
  if(is_normal) {
    test <- t.test(x)
  } else {
    test <- wilcox.test(x)
  }
  return(test$p.value < 0.05)
}

Q1.2 Assess the type 1 error for Ambrose's procedure for each of Eunice's null distributions and class sizes of $n=10, 100, 400$.


# make a method called type1_prob 
type1_prob <- function(n, n_simulations = 10000){
    experiments <- tibble(
    experiment = rep(1:n_simulations, each = n),
    quiz1 = rnorm(n * n_simulations, mean = 50, sd = 25),
    quiz2 = rnorm(n * n_simulations, mean = 50, sd = 25),
    difference = quiz2 - quiz1
  )

  tests <- experiments %>% group_by(experiment) %>%
    summarise(reject = hypothesis_test(difference))
  
  return(mean(tests$reject))
}
#Check the null hypo of the first one : 
	# Both test scores come independently from N(50, 252) distributions
tibble(n = c(10, 100, 400), map_dbl(n, type1_prob))

Test the 2nd and 3rd


unif_quiz <- function(n) {
  return(runif(n, 10, 90))
}

exp_quiz <- function(n) {
  return(rexp(n, 1 / 50))
}

type1_prob <- function(n, rand_quiz, n_simulations = 10000){
    experiments <- tibble(
    experiment = rep(1:n_simulations, each = n),
    quiz1 = rand_quiz(n * n_simulations),
    quiz2 = rand_quiz(n * n_simulations),
    difference = quiz2 - quiz1
  )

  tests <- experiments %>% group_by(experiment) %>%
    summarise(reject = hypothesis_test(difference))
  
  return(mean(tests$reject))
}

#Check the null hypo of the second one : 
tibble(n = c(10, 100, 400), map_dbl(n, ~type1_prob(.x, unif_quiz)))
#Check the null hypo of the third one : 
tibble(n = c(10, 100, 400), map_dbl(n, ~type1_prob(.x, exp_quiz)))

Week 3

Violin plot

plot_data %>% ggplot(aes(x = score, y = test)) + geom_violin()

Violin p = boxplot + histogram

Q1 : Are there any apparent similarities or differences between the two intervals or the appropriateness of the methods (T.test and Bootstrap) for this application?

plot a histogram of the diff

cbt %>% ggplot(aes(x = diff)) + geom_histogram(aes(y = after_stat(density))) + geom_density(colour = "blue", linetype = "dashed")

CLT confidence interval

For the t.test

Compute the traditional 95% confidence interval for $\mu$ assuming it is (approximately) normal^[This is often called the "CLT confidence interval" because by the central limit theorem, the sampling distribution for $\mu$ is approximately Gaussian (aka normal).

clt_ci <- t.test(cbt$diff)$conf.int

For the Bootstrap

Generate $5000$ Bootstrap samples.

Do the bootstrap

bootstrap <- tibble(experiment = rep(1:5000, each = 60),
index = sample(1:60, size = 60*5000, replace = TRUE),
ystar = cbt$diff[index])

Compute and report the 95% Bootstrap confidence interval for $\mu$.

y_bar <- mean(cbt$diff)

bias <- bootstrap %>% group_by(experiment) %>%
summarise(delta_star = y_bar - mean(ystar))

boot_ci <- y_bar + quantile(bias$delta_star, c(0.025, 0.975))

All in all,

Plot the density estimate (overlayed histogram and kernel density plot) corresponding to the empirical Bootstrap distribution (that is the distribution of y_bar + delta_star) and the position of the 2.5% and 97.5% quantiles of this distribution. In addition, indicate on the graph each of the following items:

the position of the sample mean shown by a solid vertical line,

the lower 2.5% and 97.5% quantiles of the empirical Bootstrap distribution as vertical dashed lines,

the lower and upper end points of the CLT-based confidence interval (from part i) as vertical dotted lines,

an appropriate title, and

appropriately modified axis labels.


colours <- ggthemes::colorblind_pal()(4)

bias %>% ggplot(aes(x = delta_star + y_bar)) +
  geom_histogram(aes(y = after_stat(density)),
                 fill = colours[1],
                 alpha = 0.6) +
  geom_density(colour = "blue", linetype = "dashed") +
  geom_vline(xintercept = y_bar, colour = colours[2]) +
  geom_vline(xintercept = boot_ci[1], linetype = "solid",
             colour = "Red") +
  geom_vline(xintercept = boot_ci[2], linetype = "solid",
             colour = "Red") +
  geom_vline(xintercept = clt_ci[1], linetype = "dotdash",
             colour = "Green") +
  geom_vline(xintercept = clt_ci[2], linetype = "dotdash",
             colour = "Green") +
  ggtitle("The estimated distribution of the difference") +
  xlab("Average difference between scores")

Q2 .

Compute the bootstrap confidence interval for the population skewness for

A Gaussian distribution, which has skewness zero (regardless of the mean and variance parameters)

An Exponential distribution, which has skewness 2 (regardless of the rate parameter)

Write a function that computes the coverage for the bootstrap confidence intervals. Ideally it should let you specify:

The size of the sample

The distribution the sample is drawn from

The the true value of the skewness

The number of bootstrap simulations

Note 1 : For both distributions, compute the confidence intervals for $n=10$ and $n= 100$ and comment on the coverage. (Hint: Expect it to be ok for one of them and bad for the other)

Make a function of skewness

skewness <- function(y) {
numerator <- mean((y - mean(y))^3)
denom <- (mean((y - mean(y))^2))^(3/2)
return(numerator / denom)
}

Is it in the confident interval?


is_in_ci <- function(n, simulate = function(n) rexp(n,1),
skew_true = 2, n_sim = 1000) {
y <- simulate(n)
bs <- tibble(experiment = rep(1:n_sim,
each = n),
ind = sample(1:n, n*n_sim, replace = TRUE),
ystar = y[ind])
bias <- bs %>%
group_by(experiment) %>%
summarise(delta = skewness(y) - skewness(ystar))
int <- skewness(y) + quantile(bias$delta, c(0.025, 0.975))
return((skew_true > int[1]) & (skew_true < int[2]))
}

Let's just do this for n = 10 and the exponential.

is_in = map_dbl(1:1000, ~is_in_ci(10))

mean(is_in)

I get about 0.56 - this is bad!

It turns out that this measure of skewness can be quite biased when the distribution is skewed, which doesn't actually make it very useful! This combined with the small sample size makes the bootstrap fail.

The gaussian case is a bit better, with 88% coverage for n=10 and

92% coverage for n = 100. But it's still short of the true coverage!

It turns out that skewness is moderately difficult to compute!

Week 5

Code


---
title: "Statistical Thinking Tutorial"
author: "Week 6"
date: "29/08/2021"
output:
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## What's on deck for this week

This week we are going to look at censored data. We will

- A quick check in with QQ-plots
- Learn how to fit survival data with the `survival` package 
- A closer look at the Kaplan-Meier curve 
- A better look at medians
- Fitting parametric survival models


# An amuse-bouche 


a. Simulate samples of size $n=$ 30, 100, 500 from these distributions
    i. Lognormal(4, 2)
    ii. Gamma(3, 3)
b. Make a QQ-plot of each these samples comparing to . Explain how closely the samples, of different sizes, appears to match the theoretical distribution.
c. By drawing 100 samples of size 30 from the Gamma(3,3), visualise how variable the QQ-plot can be.

## Dealing with right-censored data in R

Right censored data is a common type of "time to event" data where the event
is only observed for some of the observations. An example would be when the 
event is a death due to acute myelogenous leukemia. Some patients in the sample
will die before the end of the study, while some (hopefully most!) will still
have not died when the study ends.

In class, we talked about how important^[We talked about the statistical reasons
for doing this, but when dealing with serious data about death, we are bound both
morally and ethically to use our data as best we can.] it is to correctly account 
for both the events that we have observed and the ones that will occur, 
unobserved, in the future.

This data is so common that there is a packing in R that deals with it 
_that you you don't even have to install_. It's part of the basic R distribution.
This package is, creatively, names `survival`^[Survival can deal with more 
complex censoring than just right censoring. It does a bunch of stuff that 
we are not going to look at in this class. It's a real industrial-strength thing.].

In order to deal with survival data, we need to keep track of two things:

1. The observed time
2. An indicator of whether this was an event or if we just stopped measuring (a 
censoring). You can check the help for other options, but this is usally coded as 
a `1` if it's an event and a `0` if it is not.

We can see this data structure in the `aml` dataset (the aforementioned leukemia) 
from the `survival` package.

```{r, echo = TRUE, eval = TRUE, message = FALSE}
library(survival)
library(tidyverse)
head(aml)
dim(aml)
```

This data contains an extra factor (`x`) that indicates what type of chemotherapy
regime was used. We are going to use that to filter the data into subpopulations.

One of the key functions in the survival package is `Surv`, which lets us
put the time and event information together.

```{r, echo = TRUE, eval = TRUE}
Surv(aml$time, aml$status) %>% head()
## If you don't understand it, you can view on the next code
view(cbind(aml, Surv(aml$time, aml$status)))
```

Notice that this separates the times into numbers (9, 13) and cenored times (13+), 
where the + indicates that the event occurs at some point after 13.

## Making a survival curve

To fit survival data, we use the `survival::survfit()` function.

```{r, echo = TRUE, eval = TRUE}
fit <- survfit(Surv(time, status) ~ 1, data = aml)
```

Let's break that down:

- The `Surv(time, status)` term is there to link the time to event with the event
indicator
- The tilde (`~`) is a special character in R that we will see more of when we start
moving into regression, but it exists to separate the data (the left of `~`) from 
the "predictors" on the right. For today, we are just using `~ 1`, which says "there
are no predictors".
- The `data = aml` tells the function which data to fit the survival curve on (and
where to look for the columns `time` and `status`)

The fit object contains a bunch of stuff, but the easiest thing we can do is look
at what it prints out as a default
```{r, echo = TRUE, eval = TRUE}
fit
```

This gives the number of observations, the number of observed events and an estimate
with 95% confidence interval for the median survival time. We could've also gotten
this median with the commnad 

```{r, echo = TRUE, eval = TRUE}
median(Surv(aml$time, aml$status))
```

(The `50` is labelling the median as the 50% quantile. Is it terrible output? Yes.)

We can use the `fit` object to plot the Kaplan-Meier estimate of the survival 
curve. There are a few ways to do this, but we will use the `survminer::ggsurvplot()`
function. To use this, you will need to install the `survminer` package.

```{r, echo = TRUE, eval = FALSE}
install.packages("survminer")
```

We can then make the plot.

```{r echo = TRUE, eval = TRUE}
library(survminer)
ggsurvplot(fit)
```
By default, this plots a confidence interval and you can turn this off with the 
option `conf.int = FALSE`.


### Question
1. You can plot the Kaplan-Meier curves for the sub-populations defined by the 
covariate `x`, by replacing `~ 1` with `~ x` in the `survfit` call. Look at the 
resulting medians and Kaplan-Meier curves (use the same call to `ggsurvplot`). 
Write a few sentences commenting on the similarities and differences.

## A closer look at the Kaplan-Meier curve

So how did `ggsurvplot` make that plot? If we look a little closer at the `fit` 
object, we can work it out .

```{r echo = TRUE, eval  =TRUE}
glimpse(fit)

```

These are all the things that are computed when you call `survfit`.

- `time` is the set of unique times in the data ($t_j$ in the lectures)
- `n.event` is the number of events that were observed at each time ($d_j$ in 
the lectures)
- `n.censor` is the number of censored observations at each time
- `n.risk` is the number at risk at each time ($n_j$ in the lectures)
- `surv` is the Kaplan-Meier estimate of the survival time.

We can check that we can compute `surv` from `n.event` and `n.risk`:

```{r, echo = TRUE, eval = TRUE}
cumprod(1 - fit$n.event / fit$n.risk) - fit$surv
```

### Questions
- What does it mean that the above expression is (almost) zero?
- Why isn't it _exactly_ zero?
- What does `cumprod` do? (Also look up what `cumsum` does)
- How can you compute `n.censor` from this data?

## Computing the median

Finally, we can use the estimated survival curve to find the median. This is not
unique except for in the very unusual situation where  `fit$surv` is exactly `0.5`
at one time.  When there is no unique median, we have a number of options, two
popular ones are:

- Select the average of the two closest times with `fit$surv` on either side of 0.5
- Select the time with `fit$surv` closest to 0.5.

For big enough data sets, both of these should give about the same value, but 
for small data sets the second option (which is what R uses) is slightly more reliable
as it's possible that one of the two closest times has a survival estimate that is quite
far from 0.5.

Let's look at this!
```{r, echo = TRUE, eval = TRUE}
summary(fit)
```
The "closest" median is `27`, while the "average" median is `r (23+27)/2`.

### Questions

- Write functions to find the median using both methods. They should take a `fit` 
as an input.

- Consider the `aml` data when `x == "Maintained"`. Compute the Kaplan-Meier curve
for this subset of the data. Which of the two medians is most appropriate? 

- Repeat the exercise for `x == Nonmaintained`.

- Look at the `diabetic` data (load it with `data(diabetic)`). Why is the median 
of this data not defined?

## Fitting a parametric survival distribution

So far, we have only talked about non-parametric estimates of the survival curve.
What if we think that the uncensored data came from a particular distribution? 
Not every distribution can be used to fit survival data in R, but three of the 
most useful options are

- Exponential
- Weibull (_WARNING_ the parameterisation of the Weibull that `survival` uses is different to the standard one in R. Don't even ask.)
- Log-normal.

We fit these models using the `survival::survreg` function

```{r, echo = TRUE, eval = TRUE}
fit2 <- survreg(Surv(time, status) ~ 1, data = aml, dist = "exponential")
fit2
```
What do the paraemters mean? The `(Intercept)` term is the _log_ mean of the survival 
distribution (in this case the exponential). So the single unknown parameter in the
exponential distribution can be computed as

```{r eval = TRUE, echo = TRUE}

lambda <- exp(-coef(fit2))

```

We can then look up what the median of an exponentially distributed random variable is.

![]("exponential.png")

Using this we can compare the parametric estimate of the median to the Kaplan-Meier
estimate.
```{r, eval = TRUE, echo = TRUE}
log(2) / lambda
median(Surv(aml$time, aml$status))$quantile
```

We can also compare the survival curves.

```{r, echo = TRUE, eval = TRUE}
km_curve <- ggsurvplot(fit, conf.int = FALSE)
km_curve$plot + stat_function(fun = pexp, 
                              colour = "blue", 
                              line_type = "dashed",
                              args = list(rate = lambda,
                                          lower.tail = FALSE)
                              )
```

The resulting survival curve is a prettty good match with the empirical estimate
for early times, but after the median, the two estimates diverge wildly.

### Question

- Fit the data using a Weibull distribution. And compare the median and the survival curve and comment on your findings.

The following code will be useful for translating between the `survival` package's
parameterisation of the Weibull distribution and the standard R parameterisation.

```{r, echo = TRUE, eval = FALSE}
## fit3 is the weibull fit

pweibull_shape <- 1 / fit3$scale
pweibull_scale <- exp(coef(fit3))
```