Created date
Sep 27, 2021 11:43 PM
Data Science
Machine Learning
Artificial Intelligence
TOC by week
Tutorial solution week 2
--- title: "Statistical Thinking: Week 2 Lab" date: output: pdf_document: default html_document: df_print: paged word_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tinytex) ``` # Introduction In this week's lab, we are going to look at how to build synthetic experiments. The main reson for doing this is it let's us see how the sample size of and experiment can effect the variability of the objects of interest. We are going to learn the programming pattern for doing a simualtion study and apply it in a few situations. ### Learning Goals for Week 1 - Learn how to simulate random numbers in R - Learn how to simulate an experiment in tidy form - Identify the random variation in an estimator - Explore the randomness graphically. # Random numbers on a computer Computers, by and large, cannot produce a truly random sequence of numbers^[The wikipedia page on this goes into some detail https://en.wikipedia.org/wiki/Random_number_generation]. Instead, they produce things called _pseudo-random number sequences_, which are sequences of numbers that, for all intents and purposes, behave like random numbers. While a description of how these work is well beyond the scope of this course, we do need to know that the numbers are not truly random. And this is not a bad thing. Because the numbers aren't truly random, we can make our code _reproduce_ the same results at a later time. (We would not be able to do this if the results were truly random!) The key concept here is that of a _seed_, which is an integer that essentially starts the sequence of pseudo-random numbers. In R, we can manipulate the seed with the `set.seed()` function, which accepts an integer argument. You can use any integer you want. (In other applications, there can be security implications to using a short or common key, but that's not important here!) So let's demonstrate this. I'm going to set my seed, and then draw some random numbers (in this case 5 poissons and one normal). ```{r} set.seed(30127) rpois(n = 5, lambda = 1) rnorm(1, mean = 0, sd = 2) ``` Now if I re-set my seed to the same number and re-run those commands _in the same order_ I should get the same sequence of random numbers. ```{r} set.seed(30127) rpois(n = 5, lambda = 1) rnorm(1, mean = 0, sd = 2) ``` _An important caveat:_ You need to run all of the commands in exactly the same order to get the same sequence of numbers. This is because each time a random number is needed we are taken one more step along the pseudorandom sequence, so if we miss one of the functions we may miss one of the steps! # A simple experiment Imagine I have 100 random numbers from a normal distribuiton with mean $3$ and standard deviation $1$. I can use R to get a realisation of this experiment. ```{r} set.seed(30127) draws <- rnorm(n = 100, mean = 3, sd = 1) ``` ### Questions: 1. What is the mean of these draws? Is it exactly 3? ```{r, echo = FALSE} mean(draws) ``` 2. Repeat the "experiment" (in this case, the experiment is drawing 100 normal variables with mean 3 and standard devation 1) four more times and look at how much the sample mean changes. # Repeating the experiment If we want to explore how variable the sample mean of 100 normally distributed samples is, we are going to need to do the following two things a lot of times: 1. Simulate 100 $N(3,1)$ variables 2. Compute the sample mean If we were only going to do this a handful of times (like above), we could do it manually. But If I want to repeat the experiment a thousand times, I'm going to need to tell R to do it. We are going to have to do things like this a lot during the course, so I want tyou to pay close attention to the _programming pattern_ we use to perform this task. ## Method: Enumerate the experiments One way we could simulate 3 experiments manually would be to write code like ```{r} set.seed(30127) draws1 <- rnorm(100, 3, 1) draws2 <- rnorm(100, 3, 1) draws3 <- rnorm(100, 3, 1) ``` If you run this code, you will see that all of the numbers are different and they will have different means. But this is very labour intensive! The problem is the choice to use the name of the variable to encode which replicate of the experiment we are doing. Instead of doing this, we can instead expand our data so that instead of just recording the random number we now record the pair `(experiment, random_number)`. ```{r, message =FALSE} library(tidyverse) set.seed(30127) draws_1 <- tibble(experiment = rep(1, times = 100), draw = rnorm(100, 3, 1)) draws_2 <- tibble(experiment = rep(2, times = 100), draw = rnorm(100, 3, 1)) draws_3 <- tibble(experiment = rep(3, times = 100), draw = rnorm(100, 3, 1)) ``` (The function `rep(1, times = 100)` repeats the number 1 one hundred times.) This is quite silly at the moment because now I have the experiment number in two different places: in the actual data (good!) and in the variable name (bad!). So let's put this together in one variable. (To do this, I'm going to have to do something fancy with `rep()`) ```{r} set.seed(30127) draws_combined <- tibble( experiment = rep(1:3, each = 100), draw = rnorm(3 * 100, 3, 1) ) ``` The `rep()` function is pretty powerful. When you give it a sequence of numbers (`1:3` expands to `1, 2, 3`) you have two options: - `times = n`: Repeats the whole sequence `n` times. `r rep(1:3,times =3)` - `each = n`: Repeats each elemnt in the sequence `n` times `r rep(1:3,each =3)` The nice thing about this pattern is that I can use it to generate _any number_ of experiments! I then need to compute the mean of the draws in each experiment. This is easier to do with the manual version as I could just write ```{r} mean(draws1) ``` So what is the equivalent here? We need to use the `group_by`/`summarise` pattern from the `dplyr` package (part of tidyverse!). This does kinda what you'd expect: it first groups the data by a variable (in this case we would choose experiment so that each row from the same experiment is in the same group) and then we can use `summarise` to compute summaries of the groups. ```{r} sampling_distribution <- draws_combined %>% group_by(experiment) %>% summarise(sample_mean = mean(draw)) ``` We read this as: - Take the tibble `draws_combined` and - Group it by experiment - Make a new variable called `sample_mean` that is the sample mean of each experiment. ### Questions 3. Run the experiment 100,000 times and make a histogram of the sampling distribution^[The technical term of the distribution of a statistic (like the mean) under repeated sampling is the _sampling distribution_.] of the sample mean. ```{r, echo = FALSE, fig.show='hide'} set.seed(30127) N <- 100000 draws_combined <- tibble( experiment = rep(1:N, each = 100), draw = rnorm(N * 100, 3, 1) ) sampling_distribution <- draws_combined %>% group_by(experiment) %>% summarise(sample_mean = mean(draw)) sampling_distribution %>% ggplot(aes(sample_mean)) + geom_histogram(fill = "purple", bins = 30) + theme_bw() + xlab("Sample mean") ``` 4. It turns out that you can compute the sampling distribution of the sample mean exactly: it is normally distributed with mean `3` and variance `1/100`. Overlay the empirical histogram with the theoretical density. (See last week's bonus video for how to do this!) ```{r, echo = FALSE, fig.show='hide'} sampling_distribution %>% ggplot(aes(sample_mean)) + geom_histogram(aes(y = after_stat(density)), fill = "purple", bins = 50) + stat_function(fun = function(x) dnorm(x, 3,sqrt(1/100)), colour = "red", linetype = "dashed", size = 1.2) + theme_bw() + xlab("Sample mean") ``` 5. For the same experiment, plot the sampling distribution of the standard deviation. ```{r, echo = FALSE, fig.show='hide'} sampling_distribution <- draws_combined %>% group_by(experiment) %>% summarise(sample_mean = sd(draw)) sampling_distribution %>% ggplot(aes(sample_mean)) + geom_histogram(fill = "purple", bins = 30) + theme_bw() + xlab("Sample standard deviation") ``` 6. For the same experiment, plot the sampling distribution of the largest observation in the sample. ```{r, echo = FALSE, fig.show='hide'} sampling_distribution <- draws_combined %>% group_by(experiment) %>% summarise(sample_mean = max(draw)) sampling_distribution %>% ggplot(aes(sample_mean)) + geom_histogram(fill = "purple", bins = 30) + theme_bw() + xlab("Sample standard deviation") ``` # Using the pattern to visualise more complex uncertainty Sometimes, we want to know if a sample has a particular distribution. An example of this would be when we do linear regression and we want to check if the residuals are normally distributed. While there are some formal statistical tests for doing this, the most useful techniques are graphical. So we are going to look at one of those today and show how we can extend the idea of the sampling distribution that we developed in the previous example to graphs. ## Scenario: We have a sample that is supposed to be normal. ```{r} data <- rt(20, df = 10) ``` We have some data that we have observed. (In this case, it's from a T distribution with 10 degrees of freedom, but pretend we don't know that.) We were told that this data should be $N(0,1)$. How can we check? One choice would be to plot a histogram and compare it against the theoretical density, but it turns out to be more useful to compare the empircal cumulative density function (eCDF) of the sample to the theoretical CDF of the standard normal distribution^[This comparision is conceptually similar to the qq-plot that you may have seen in a different class.] ```{r} tibble(sample = data) %>% ggplot(aes(x = sample)) + stat_ecdf() + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") ``` This is all well and good, but is the empirical CDF close enough to the theoretical one to say that it's from the same distribution? To answer that, we need to assess how variable and emiprical CDF can be when it's computed based on 20 samples. ## Repeating the process So. We need to assess the variability in an empirical CDF based on 20 samples. How do we do this? We follow the pattern above and make _multiple_ data sets with 20 points that come from a $N(0,1)$ distribution! And then we plot their empirical CDFs! ```{r} set.seed(30127) N <- 100 experiments <- tibble(experiment = rep(1:N, each = 20), draw = rnorm(N * 20, 0, 1)) ggplot(experiments, aes(x = draw)) + stat_ecdf() + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") ``` Oh no! That does not look right! Or, to be more precise, that _looks too good_! Nothing on that plot looks like the empircal CDF from 20 samples. So we've done something wrong. What we accidentally did is plot the empircal CDF of 2000 (20 x 100) samples! This should be an identical graph: ```{r} set.seed(30127) tibble(draw = rnorm(2000)) %>% ggplot(aes(x = draw)) + stat_ecdf() + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") ``` (Those two graphs really are the same because we used the same seed and generated the same number of normals!) So what did we forget to do? We forgot to _group_ by experiment. Previously, we did this with the `group_by`/`summarise` pattern. For plots, we instead add a `group` aesthetic. ```{r} N <- 100 experiments <- tibble(experiment = rep(1:N, each = 20), draw = rnorm(N * 20, 0, 1)) ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment)) + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") ``` Here we only needed `group = experiment` for the `stat_ecdf` layer, so we put it in that part of the ggplot call. It would still work in the global aesthetic but it would have thrown a warning because the `group` aesthetic doesn't make sense in `stat_function()`. So this is ... ok. But it's very ugly. The problem is that we have plotted 101 lines on top of each other on this plot and it is visually very messy. We can fix this with two tricks: colour and transparancy. ```{r} ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment), colour = "lightgrey", alpha = 10/100) + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") + theme_classic() + xlab("x") ``` This looks a lot better: - Removing the grey background and the grid with `theme_classic()` reduces the visual clutter. - Making the empirical CDFs a lighter colour is useful to make them less prominent. - The `alpha` parameter makes the lines semi-transparent. This helps reduce their intensity. This parameter is difficult to set and you should try a bunch of different values. (`alpha = 1` is the default value, and `alpha = 0` is fully transparent.) Why try to de-emphasise the empircal CDFs? Because we don't care aboout the individual lines. We only care about their variability! Because this gives us a visual _window_ where the deviation from the theoretical CDF is purely due to the small sample size. If the empirical CDF of the sample goes outside this window, we are seeing something qualitatively different from our 100 draws. Finally, let's add in the sample. To do this, we need to add a new layer that is an empirical CDF of the actual data we care about. This layer needs to have _different data_ to the rest of the plot, and we can do that by adding a `data = ...` argument to the `stat_ecdf()` layer. We also need to define an aesthetic in this layer so that ggplot knows what to put on the x-axis. As always, ggplot build the plot in the order it's specified. So it builds the basic axes, then it adds the semi-transparent theoretical empirical CDFs, then it adds the data empirical CDF, and finally it adds the theoretical CDF. ```{r} ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment), colour = "lightgrey", alpha = 10/100) + stat_ecdf(data = tibble(real_data = data), aes(x = real_data)) + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") + theme_classic() + xlab("x") ``` In this case, the data empirical CDF does not seem any stranger than the empirical CDFs build from 20 samples from the true distribution. So we cannot say from this plot that the data does not come from a N(0,1) distribution. But there is an _important thing to notice_: We also haven't proved that the does come from a N(0,1) distribution! In fact, we know that it does not! What we have shown is that _we cannot distinguish between the sample and 20 draws from a standard normal distribution_. ### Questions 7. Can you tell the difference when the experiment has 2000 observations instead of 20. (This will take a little longer to run!) ```{r, eval = FALSE, echo = FALSE} set.seed(30127) N <- 100 data <- rt(2000, df = 10) experiments <- tibble(experiment = rep(1:N, each = 2000), draw = rnorm(N * 2000, 0, 1)) ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment), colour = "lightgrey", alpha = 10/100) + stat_ecdf(data = tibble(real_data = data), aes(x = real_data)) + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") + theme_classic() + xlab("x") ``` 8. Can you tell the difference between 20 samples from a T-distribution with 5 degrees of freedom and a N(0,1) distribution? ```{r, eval = FALSE, echo = FALSE} set.seed(30127) N <- 100 data <- rt(20, df = 5) experiments <- tibble(experiment = rep(1:N, each = 20), draw = rnorm(N * 20, 0, 1)) ggplot(experiments, aes(x = draw)) + stat_ecdf(aes(group = experiment), colour = "lightgrey", alpha = 10/100) + stat_ecdf(data = tibble(real_data = data), aes(x = real_data)) + stat_function(fun = pnorm, colour = "blue", linetype = "dashed") + theme_classic() + xlab("x") ```
Tutorial solution week 3
--- title: 'Statistical Thinking: Week 3 Lab' date: output: pdf_document: default html_document: default --- ## Overview This week we are going to look closer at evaluating the type 1 error of complex statistical procedures. ## Learning aims - Write small functions in R - Use simulation to evaluate type1 error - Think about how tests work and whether or not they can be composed ## Meet Eunice, an academic cat trainer Eunice is in a pickle. She is a new professor at Monash's new Faculty of Circus Arts and she has been tasked to develop a new unit: _CTC2420: Catistical Rinking_. Eunice put a lot of work into developing this unit, which focuses on methods to train kittens to ice skate, and is really excited to teach her first group of students. Tragically, the semester does not go well. The marks for the first assessment, which is a quiz in week 3, are _terrible_ and Eunice needs to re-assess her teaching strategy. Thankfully, Eunice is a professional and she re-builds the next part of the course based on feedback from students and colleague. She has a second quiz planned for week 6, but wants a good method for assessing if students were improving. ## Eunice's quandry Both quizzes are marked out of 100, so Eunice figures that she can look at the difference between their mark on each student's mark on the second exam and their mark on the first exam and test to see if this deviates from zero. Unfortunately, while Eunice is a world-class cat trainer and has written several excellent books about the history and practice of ice-skating cats, she only ever did one statistics unit in undergrad and that was a long time ago. She remembers two things: 1. If differences in marks are approximately normally distributed, she can test this hypotheses using a _T-Test_^[You should have heard of this one! The test statistic to test $H_0: \mu = 0$ is $$ \frac{\bar{y}}{s_y}, $$ where $s_y$ is the sample standard deviation. (You don't need to know this test statistic!)], which has the R function `t.test`. 1. If the differences in marks are not normally distributed, she can use a _Wilcoxon signed rank test_^[I'd be surprised if you've heard of this test but it is a real thing that is used when you can't assume normality. The test statistic is a little bit more complicated than the T-test. Technically, the Wilcoxon signed rank test tests the null hypothesis that the differences come from a continuous symmetric distribution, which will be true if each exam mark comes iid from the same distribution, which is a reasonable null hypothesis in this context!], which has the R function `wilcoxon.test`. ```{r, echo = FALSE, message = FALSE} library(tidyverse) set.seed(30127) n <- 418 marks <- tibble(student = 1:n, quiz1 = (rgamma(n, shape = 0.5, rate = 1/200) %% 100) + runif(n) - 1 , quiz2 =(rgamma(n, shape = 1, rate = 1/200) %% 100) + runif(n) - 1 ) save(marks, file = "eunice.RData") ``` ### Questions 1. Perform a T-test and a Wilcoxon test on this data. What are you conclusions? 2. Which test do you think is more reliable in this scenario? ```{r echo = FALSE, eval = FALSE} marks <- mutate(marks, difference = quiz2 - quiz1) t.test(marks$difference) wilcox.test(marks$difference) ``` ```{r,echo = FALSE} # Both provide very strong evidence against the mean. # If you look at the histogram of the difference it has # some non-normality in the qqolot, which suggests that # wilcoxon might be more appropriate. ``` ## Ambrose has a terrible idea Worried about the future, Eunice decides to talk to Ambrose. Ambrose is another professor in the Monash Faculty of Circus Arts. He's in charge of training the herring for the herring circus. (His specialty is training herring to be shot out of cannons.) Ambrose remembers that there is actually a statistical test to see if a sample is unusual compared to one that comes from a normal distribution. He doesn't remember most of the details, except that it is called the _Shapiro-Wilks_ test and it is implemented in R as `shapiro.test()`. So Ambrose suggests the following procedure: 1. Perform a Shapiro-Wilks test and compute a p-value (the probability of seeing a larger value than the computed test statistic under the null hypothesis) 2. If the p-value is bigger than $0.05$, then perform a t-test and report the p-value. Otherwise, perform a Wilcoxon test and report the p-value. Ambrose (who is wrong) argues that this works because if we reject the null hypothesis at the 0.05 level for both the T- and Wilcoxon tests, we will still get the correct type one error. He says \begin{align*} \Pr(\text{Reject}) &= \Pr(\text{(Reject Shapiro and Reject Wilcoxon) OR (Don't Reject Shapiro and Reject T-test)}) \\ & = \Pr(\text{(Reject Shapiro and Reject Wilcoxon)}) + \Pr(\text{(Don't Reject Shapiro and Reject T-test)}) \\ &= \Pr(\text{Reject Shapiro})\Pr(\text{Reject Wilcoxon}) + \Pr(\text{Don't Reject Shapiro})\Pr(\text{Reject T-test}) \\ &= 0.05\Pr(\text{Reject Shapiro}) + 0.05(1-\Pr(\text{Reject Shapiro}))\\ &= 0.05 \end{align*} ### Question 3. What is Ambrose's mistake here? ```{r, echo = FALSE} # The mistake is that the Shapiro-Wilks test and the t-test/wilcoxon test # are run on the same data, rather than on independent data. # This means that the outcomes of these tests are not independent # which means that # Pr((Reject Shapiro) and (Reject Wilcoxon)) = # Pr(Reject Shapiro)Pr(Reject Wilcoxon | Reject Shapiro) # # This is NOT equal to Pr(Reject Shapiro)Pr(Reject Wilcoxon)! ``` ## Eunice likes Ambrose, but she doesn't trust him Trust between friends and colleagues is a wonderful and important thing. But when we are doing quantitative work, it never hurts to double check things computationally. So Eunice decides to fire up R and actually check if Ambrose's procedure has the correct type 1 error. Her null hypothesis is that the modifications to her teaching hasn't changed anything and the scores on each quiz is iid from some distribution. She dug up some code from her old statistics class that computed the type 1 error for some weird test as ```{r} type1_prob <- function(n, n_simulations = 10000){ experiments <- tibble( experiment = rep(1:n_simulations, each = n), draw = rnorm(n * n_simulations) ) tests <- experiments %>% group_by(experiment) %>% summarise(test = test_statistic(draw)) threshold <- 4.8 / sqrt(n) return(mean(abs(tests$test) > threshold)) } ``` This isn't quite what she wants. She does not know the form of the test statistic, and she does not know the threshold. But she does know that if she rejects the null hypothesis whenever the p-value is < 0.05, then the type1 error should be 0.05 (why?!). She knows that the R-interface for hypothesis tests are always the same. The output is a list and the `$p.value` element stores the p-value from the test. ```{r} x <- rnorm(100) test <- t.test(x) p_value <- test$p.value p_value ``` To warm up, Eunice decides to check the type one error for the Shapiro-Wilks test. ### Questions 4. Write a _function_ in R called `hypothesis_test` that accepts or rejects the Shapiro-Wilks hypothesis test at the 5% level. It should return `TRUE` if the null is rejected and `FALSE` if the null is not rejected. ```{r, echo = FALSE} hypothesis_test <- function(x) { test <- shapiro.test(x) return(test$p.value < 0.05) } ``` 5. Modify the function from Eunice's class notes so that it computes the type1 form 10000 simulations. Use this function to convince yourself that the Shapiro-Wilks test is going to give you the correct type 1 error rate. (Hint: Use the `hypothesis_test` function you just wrote) ```{r, eval = FALSE, echo = FALSE} type1_prob <- function(n, n_simulations = 10000){ experiments <- tibble( experiment = rep(1:n_simulations, each = n), draw = rnorm(n * n_simulations) ) tests <- experiments %>% group_by(experiment) %>% summarise(reject = hypothesis_test(draw)) return(mean(tests$reject)) } type1_prob(100) ``` ## Eunice checks Ambrose's work With that scaffolding in place, Eunice is ready to check if Ambrose's suggestion is any good. Although her full time work is focused on strapping ice skates to tiny kittens, she is pretty good at statistics and immediately sees a problem: she doesn't know what type of null distributions to check against. So she decides to check 3 different distributions that all have the same mean. 1. Both test scores come independently from $N(50,25^2)$ distributions 1. Both test scores come independently from $\text{Uniform}(10,90)$ distributions 1. Both test scores come independently from $\text{Exp}(1/50)$ distributions Her reasoning is that the first one is the best case for the tests, the second corresponds to students who are at all levels, and the third corresponds to more students getting low marks than high marks. ### Questions 6. Make a new function `hypothesis_test` that implements Ambrose's proposal. You are going to need the if/else syntax here, which allows R to run different code depending on whether or not something is true. The following code outputs whether a number is positive, negative or zero. In your code, you probably won't need the `else if` option, but it's there for completeness. ```{r} example <- function(x) { if (x < 0) { return("negative") } else if (x > 0) { return("positive") } else { ## this is what happens if none of the above hold return("zero") } } example(-3) example(pi) example(0) ``` ```{r, echo = FALSE} hypothesis_test <- function(x) { shapiro <- shapiro.test(x) is_normal <- (shapiro$p.value > 0.05) if(is_normal) { test <- t.test(x) } else { test <- wilcox.test(x) } return(test$p.value < 0.05) } ``` 7. Assess the type 1 error for Ambrose's procedure for each of Eunice's null distributions and class sizes of $n=10, 100, 400$. Some hints and suggestions: - Don't be a hero. The easiest way to do this is to write three functions (one for each population type). - Run the code! If you get an answer that doesn't seem realistic (like a type 1 error probability of close to 1), you've probably made a mistake. Check that you're doing the test on the _difference_ of two scores! - If you do want to be a hero / stretch yourself, try to make a function that lets you specify how the quiz marks are generated. ```{r, echo = FALSE, eval = FALSE} ## Normals. Note this one actually works! type1_prob <- function(n, n_simulations = 10000){ experiments <- tibble( experiment = rep(1:n_simulations, each = n), quiz1 = rnorm(n * n_simulations, mean = 50, sd = 25), quiz2 = rnorm(n * n_simulations, mean = 50, sd = 25), difference = quiz2 - quiz1 ) tests <- experiments %>% group_by(experiment) %>% summarise(reject = hypotheis_test(difference)) return(mean(tests$reject)) } tibble(n = c(10, 100, 400), map_dbl(n, type1_prob)) ``` ```{r, echo = FALSE, eval = FALSE} ## I'll give an example of some generic code. ## But it is totally OK to just do the above ## with different distributions for quiz1/quiz2. unif_quiz <- function(n) { return(runif(n, 10, 90)) } exp_quiz <- function(n) { return(rexp(n, 1 / 50)) } type1_prob <- function(n, rand_quiz, n_simulations = 10000){ experiments <- tibble( experiment = rep(1:n_simulations, each = n), quiz1 = rand_quiz(n * n_simulations), quiz2 = rand_quiz(n * n_simulations), difference = quiz2 - quiz1 ) tests <- experiments %>% group_by(experiment) %>% summarise(reject = hypothesis_test(difference)) return(mean(tests$reject)) } tibble(n = c(10, 100, 400), map_dbl(n, ~type1_prob(.x, unif_quiz))) tibble(n = c(10, 100, 400), map_dbl(n, ~type1_prob(.x, exp_quiz))) ``` ## Conclusion So Eunice's testing showed that Ambrose's procedure is OK in the scenarios that were tested. It's worth saying that this was not obvious---she was correct to check! There's a little bit more here that's worth saying. - This will not always happen! Ambrose was lucky. This happens for these alternatives basically because Shapiro-Wilk is a very good test. - We have only checked for three different shapes of the data. It might not work when the difference distribution has a heavy tail. How would you check this?

Week 4 Methods for computing bootstrap confidence intervals

Cognitive Behavioural Therapy: Does CBT have an effect?
Cognitive behavioural therapy (CBT) is a psychological treatment technique that aims to help a person change their thoughts (cognition) and their behavioural patterns. By learning to replace negative thoughts with more positive ones, and to correspondingly modify negative behaviours with the aim of improving feelings of anxiety and/or depression.
Although CBT has been an important treatments for anxiety and depression over many decades now, new methods for delivering the CBT treatment are regularly sought to try to improve the effectiveness of the general technique.
A recent study was undertaken to assess such a new CBT delivery method. In total, 60 people were recruited to voluntarily participate in the study, all participants having had a recent clinically confirmed episode of anxiety or depression, or both.
Each study participant was asked to complete a certain psychological assessment on two occasions, once before the new CBT delivery treatment method was applied, and once at the end of the treatment. As higher scores on the assessment are associated with an increase in anxiety and depression, it is hoped that the individual scores will be reduced following the new CBT treatment, compared with the corresponding scores obtained at the start of the study.
Each participant's scores on these two assessments are contained in a datafile named "CBT.csv", which is available now on the unit Moodle site. Each of the rows of the CBT.csv data file correspond to one of the sixty (60) subjects who participated in the CBT delivery method experiment. Column 1 corresponds to the individual participant's case number, while the values stored in columns 2 and 3 of each row (for the score1 and score2 measures) relate to the assessment scores for the participant corresponding to the row case number, with score1 the score of the assessment completed before the start of the CBT therapy and score2 the score of the assessment completed following the end of the CBT treatment.
In this example, we will compute a confidence interval for the difference in means before and after treatment.

Exercise 1 Plot the data

Plot the data! Construct a box-plot and visually compare the two distributions. Does it look like there is a difference?
Step 1
To make the data easier to plot in ggplot, we need to do one small manipulation. `ggplot` expects data in a specific format. When it goes to make a boxplot, _row_ in the data frame should contain the following aesthetics: - `x` --- the data being plotted (in this case the score) - `group` --- The group the data belongs to ("before" or "after") The problem we have is that our data doesnt' have the group stored explicitly! Instead, each row of the data contains and individual, score 1, and score 2. Thankfully, we can manipulate the data into the form that `ggplot` expects pretty easily^[If you've done ETC1010, you would've seen a different and much more general way of doing this using `pivot_longer` from the `tidyr` package. Because we only need to do this once, we are going to do it the manual way for the benefit of people who did not do 1010].
1. Sort the df for ploting ```{r, eval = FALSE, echo = TRUE} library(tidyverse) cbt <- read_csv("CBT.csv") plot_data <- tibble( test = rep(c("before", "after"), each = length(cbt$score1)), score = c(cbt$score1, cbt$score2)) ```
2. Plot the data ```{r, eval = FALSE, echo = FALSE} plot_data %>% ggplot(aes(x = score, y = test)) + geom_violin() ## a boxplot would be fine here too. ```

Exercise 2

Scenario :
Consider the variable named diff constructed from the difference in CBT assessment scores, with diff=score 2 - score. Use both the t-test and Bootstrap-based approaches to obtain a 95% confidence interval for each of the unknown population mean $\delta=\mu_2-\mu_1$. In addition, provide a plot of the approximate sampling distribution for the relevant point estimator, $\bar{X}$, with the point estimate and 95% confidence bounds indicated for each method. Comment on any apparent similarities or differences between the two intervals or the appropriateness of the methods for this application.

2.1. Take the CBT data and calculate the diff variable. Then display a plot of the sample histogram overlaid with a plot of the estimated density (use geom_density()). Make sure your histogram is on density scale!

```{r, echo = FALSE, eval = FALSE} cbt <- cbt %>% mutate(diff = score2 - score1) cbt %>% ggplot(aes(x = diff)) + geom_histogram(aes(y = after_stat(density))) + geom_density(colour = "blue", linetype = "dashed") ```

2.2 Compute the traditional 95% confidence interval for $\mu$ assuming it is (approximately) normal^[This is often called the "CLT confidence interval" because by the central limit theorem, the sampling distribution for $\mu$ is approximately Gaussian.]. You can get this from the output of t.test(). Save the value of the point estimate, in this case the sample mean of diff.

clt_ci <- t.test(cbt$diff)$conf.int

2.3. Generate $5000$ Bootstrap samples.

```{r eval = FALSE, echo = FALSE} bootstrap <- tibble(experiment = rep(1:5000, each = 60), index = sample(1:60, size = 60*5000, replace = TRUE), ystar = cbt$diff[index]) ```

2.4. Compute and report the 95% Bootstrap confidence interval for .

```{r eval = FALSE, echo = FALSE} y_bar <- mean(cbt$diff) bias <- bootstrap %>% group_by(experiment) %>% summarise(delta_star = y_bar - mean(ystar)) boot_ci <- y_bar + quantile(bias$delta_star, c(0.025, 0.975)) ```

2.5. Plot the density estimate (overlayed histogram and kernel density plot) corresponding to the empirical Bootstrap distribution (that is the distribution of y_bar + delta_star) and the position of the 2.5% and 97.5% quantiles of this distribution.

In addition,
indicate on the graph each of the following items:
  • the position of the sample mean shown by a solid vertical line,
  • the lower 2.5% and 97.5% quantiles of the empirical Bootstrap distribution as vertical dashed lines,
  • the lower and upper end points of the CLT-based confidence interval (from part i) as vertical dotted lines,
  • an appropriate title, and
  • appropriately modified axis labels.
code for 2.5
```{r, echo = FALSE, eval = FALSE} colours <- ggthemes::colorblind_pal()(4) bias %>% ggplot(aes(x = delta_star + y_bar)) + geom_histogram(aes(y = after_stat(density)), fill = colours[1], alpha = 0.6) + geom_density(colour = "blue", linetype = "dashed") + geom_vline(xintercept = y_bar, colour = colours[2]) + geom_vline(xintercept = boot_ci[1], linetype = "dashed", colour = colours[3]) + geom_vline(xintercept = boot_ci[2], linetype = "dashed", colour = colours[3]) + geom_vline(xintercept = clt_ci[1], linetype = "dotdash", colour = colours[4]) + geom_vline(xintercept = clt_ci[2], linetype = "dotdash", colour = colours[4]) + ggtitle("The estimated distribution of the difference") + xlab("Average difference between scores") ```

Exercise 4

(since ex3 is bonus, so we skipped it)

Remember the sample skewness from week 2
An approximation to the population skewness, which is defined as

Compute the bootstrap confidence interval for the population skewness for
  • A Gaussian distribution, which has skewness zero (regardless of the mean and variance parameters)
  • An Exponential distribution, which has skewness 2 (regardless of the rate parameter)
You should try to write a function that computes the coverage for the bootstrap confidence intervals. Ideally it should let you specify:
  • The size of the sample
  • The distribution the sample is drawn from
  • The the true value of the skewness
  • The number of bootstrap simulations

Question : For both distributions, compute the confidence intervals for and and comment on the coverage.

(Hint: Expect it to be ok for one of them and bad for the other)
```{r, eval = FALSE, echo = FALSE} ## Using code from class skewness <- function(y) { numerator <- mean((y - mean(y))^3) denom <- (mean((y - mean(y))^2))^(3/2) return(numerator / denom) } is_in_ci <- function(n, simulate = function(n) rexp(n,1), skew_true = 2, n_sim = 1000) { y <- simulate(n) bs <- tibble(experiment = rep(1:n_sim, each = n), ind = sample(1:n, n*n_sim, replace = TRUE), ystar = y[ind]) bias <- bs %>% group_by(experiment) %>% summarise(delta = skewness(y) - skewness(ystar)) int <- skewness(y) + quantile(bias$delta, c(0.025, 0.975)) return((skew_true > int[1]) & (skew_true < int[2])) } ## Let's just do this for n = 10 and the exponential. ## this takes a moment to run. It would be faster with ## fewer simulations is_in = map_dbl(1:1000, ~is_in_ci(10)) mean(is_in) ## I get about 0.56 - this is bad! ## It turns out that this measure of skewness can be quite biased when the distribution is skewed, which doesn't actually make it very useful! This combined with the small sample size makes the bootstrap fail. ## The gaussian case is a bit better, with 88% coverage for n=10 and ## 92% coverage for n = 100. But it's still short of the true coverage! ## It turns out that skewness is moderately difficult to compute! ```



Week 5 Parameters estimate using MLE

Lib needed
```{r, echo=TRUE, eval=TRUE} #part a library(tidyverse) library(broom) library(gridExtra) library(MASS) ```

Part A : parametric bootstrap

Steps are given to simulate data from a distribution, and subsequently obtain the MLE for and .

1. Simulate a random sample of size $n=36$ from a given distribution, $F$, here the distribution.

Call the vector of observations y, and create a tibble named dt that contains y and an observation ID number, from 1 to $n$, corresponding to the tibble row number.
```{r} ####Set the random generator seed set.seed(24601) n <- 36 mu.true <- 10 sig.true <- 2 y <- rnorm(n, mean = mu.true, sd = sig.true) dt <- tibble(id = 1:n, y = y) #head(dt) ```

2. Produce a plot that contains a histogram of y as well as a kernel density estimate of the probability density function (pdf) of the distribution that generated y based on the simulated data.

(You can change the colours or theme if you want to!)
```{r} dt %>% ggplot(aes(x = y)) + geom_histogram(aes(y = after_stat(density)), bins = 20, colour = "blue", fill = "blue", alpha = 0.2) + geom_density(colour="blue", alpha = 0.2) + ggtitle("Histogram of n = 36 draws simulated from N(10,4)") + xlab("simulated draws") + theme_classic() ```

3. Produce a plot that contains a histogram of y as well as a kernel density estimate of the probability density function (pdf) of the distribution that generated y based on the simulated data.

(You can change the colours or theme if you want to!)
```{r} dt %>% ggplot(aes(x = y)) + geom_histogram(aes(y = after_stat(density)), bins = 20, colour = "blue", fill = "blue", alpha = 0.2) + geom_density(colour="blue", alpha = 0.2) + ggtitle("Histogram of n = 36 draws simulated from N(10,4)") + xlab("simulated draws") + theme_classic() ```

4. Use the MASS::fitdistr() function to estimate the same $F$. Save the object produced and name it normal_fit. Once you have done this, run the rest of the code in the code chunk below. What does it show?

```{r} #install.packages("MAss") library(MASS) normal_fit <- fitdistr(y, "normal") normal_fit normal_fit %>% tidy() ```
we are pretending we don't know the "true population values" but want to fit a normal distribution to the data. This means estimating the values of the parameters, $\mu$ and $\sigma$. Of course we want to use the maximum likelihood estimation method, along with the sample data in y, to estimate the parameter values $\mu$ and $\sigma$ for a $N(\mu, \sigma^2)$ population. One way to do this in R is with the MASS::fitdistr(). Review the help file for the fitdistr() function to find out what it will do.^[You'll need to load the MASS package library before you can see the help for the fitdistr() function.[]
Explore the "F_fit" object produced in part d. (What kind of object is it?) Start by taking a look at what follows from each of the print statements below. What does each represent? What is the fitted population distribution?
```{r } normal_fit$estimate normal_fit$sd normal_fit$vcov normal_fit$n normal_fit$loglik ```

5. Calculate the fitted normal pdf at each observed y value in the sample,and save to your tibble as a variable named fit_normpdf. Overlay the estimated population density function at each y value in red dots on the previous plot. Does the image surprise you? (It may or it may not - it will depend on what you are expecting to see!)

```{r} ## save to simpler names - is clearer what the estimates represent mu_fit <- normal_fit$estimate[1] sig_fit <-normal_fit$estimate[2] dt %>% ggplot(aes(x = y)) + geom_histogram(aes(y = after_stat(density)), bins = 20, colour = "blue", fill = "blue", alpha = 0.2) + geom_density(colour="blue", alpha = 0.2) + stat_function(fun = dnorm, args = list(mean = mu_fit, sd = sig_fit)) + ggtitle("Histogram of n = 36 draws simulated from N(10,4)") + xlab("simulated draws") + theme_classic() ```

6. Create a vector containing the values of the theoretical cumulative distribution function (CDF) for $F_X(x_i)$, at each observation $X_i$ in $x$, using the estimated parameter values. Name the vector as fit_normCDF and add the variable created to your tibble dt.

```{r} dt <- dt %>% mutate(fit_normCDF = pnorm(y, mu_fit, sig_fit)) ```

7. Produce a plot by first arranging your tibble dt according to the order of the observed values x, and then using a line plot of the CDF_fit variable according to the arranged x values.

```{r} dt %>% arrange(y) %>% ggplot(aes(x = y, y = fit_normCDF)) + geom_point(col="red") + geom_line(colour="red") + ggtitle("Fitted Normal CDF") + theme_bw() + xlim(c(0,20)) + ylab("cumulative probability function (CDF)") + xlab("x") p1_normCDF ```

8. Add a plot of the empirical CDF function to your plot, by adding the stat_ecdf() geom.

```{r} dt %>% arrange(y) %>% ggplot(aes(x = y, y = fit_normCDF)) + geom_point(col="red") + geom_line(colour="red") + stat_ecdf() + ggtitle("Fitted Normal CDF") + theme_bw() + xlim(c(0,20)) + ylab("cumulative probability function (CDF)") + xlab("x") + labs(subtitle="Empirical CDF shown in black") p1_normCDF ```


Part B Use a non-parametric bootstrap to estimate confidence intervals for the estimated parameters.

1. Calculate the CLT-based 95% confidence intervals for each parameter in $F$, using the MLE for each parameter and its corresponding estimated standard error

[The standard error is the standard deviation of the estimate, considered as a function of the random data.].
We then need to manually construct the CLT-based confidence interval for each component of the parameter, $\theta=(\mu, \sigma)$. We do this by calculating vector of lower end points and the vector of upper end points for the pair of 95% confidence intervals.
```{r} set.seed(30127) mle_est <- normal_fit$estimate # point estimate mle_se <- normal_fit$sd # estimated SE clt_lower <- mle_est + qnorm(0.025) * mle_se clt_upper <- mle_est + qnorm(0.975) * mle_se ``` Given this information, state the 95% confidence intervals for $\mu$ and $\sigma$, respectively. ```{r} cbind(clt_lower, clt_upper) ```

2. Implement the Bootstrap sampling procedure, generating $B$ replicate datasets by sampling from y with replacement, and saving the MLEs associated with each replicated sample as the Bootstrap sample.

Things are a little different this time because we are bootstrapping two quantities at the same time! So we need to do a few things:
1. We summarise our groups using a tidid up `fitdistr`. This produces a tibble column called fit. 2. We then pull the things we need out of that column. 3. We then add the needed true values with a mutate. To do this we use `if_else` to make sure we add the _correct_ estimate (we add `mu_fit` when `term == "mean"` and `sig_fit` otherwise.) 4. And finally compute our biases
```{r} B <- 5000 experiments <- tibble(experiment = rep(1:B, each = n), index = sample(1:n, size = n * B, replace = TRUE), ystar = dt$y[index]) bias <- experiments %>% group_by(experiment) %>% summarise(fit = fitdistr(ystar, "normal") %>% tidy, term = fit$term, theta_star = fit$estimate) %>% mutate(theta_hat = if_else(term == "mean", mu_fit, sig_fit), delta = theta_hat - theta_star ) ```

3. Obtain the lower 2.5% and 97.5% quantiles from each component of the MLEs in the Bootstrap sample.

We do this by grouping and summarising again, but this time we group by `term`! (Note we have to use `unique(theta_hat)` so that it knows there's only one of them! ```{r} conf_ints <- bias %>% group_by(term) %>% summarise(lower = unique(theta_hat) + quantile(delta, 0.025), upper = unique(theta_hat) + quantile(delta, 0.975)) conf_ints ```

4. Obtain and display a plot of the Bootstrap sampling distribution, for each of the MLE components.

(You can add your own axis labels and titles...)
```{r} bias %>% ggplot(aes(x = theta_hat + delta)) + geom_histogram() + facet_wrap(~term, scales = "free_x") ```


Part D

(since part C is bonus q, we skipped it)

1. Generate a sample of size $n=544$ from a $Gamma(3.2, 1.7)$ distribution.

```{r echo=TRUE} set.seed(123) y <- data.frame(x=rgamma(n=544, 3.2, 1.7)) ```
The distribution appears to be unimodal, positive, and skewed
What parameters of the gamma distribution were used to simulate the sample?
($\alpha = 3.2, \beta = 1.7$)

2. If we are to use maximum likelihood distribution what values would we expect to get as the parameter estimates?

```{r} fitdistr(y$x, "gamma") %>% tidy() ```

3. Write a function to compute the log-likelihood function.

```{r} llik <- function(alpha, beta){ return(sum(dgamma(y$x, alpha, beta, log = TRUE))) } ```

4. Plot the likelihood function for a range of values of $\alpha, \beta$ that shows the maximum likelihood estimates for each parameter.

```{r} library(viridis) alpha <- seq(2,4, length.out = 100) beta <- seq(1,2, length.out = 100) grid <- expand.grid(alpha = alpha, beta = beta) grid <- grid %>% mutate(llik = llik(alpha, beta)) grid %>% ggplot(aes(x = alpha, y = beta, fill = llik)) + geom_tile() grid[which.max(grid$llik),] ```


Week 6 Survival Analysis : censored data


Part 1 : A quick check in with QQ-plots

Q1 Simulate samples n QQ-plots
a. Simulate samples of size $n=$ 30, 100, 500 from these distributions i. Lognormal(4, 2) ii. Gamma(3, 3) b. Make a QQ-plot of each these samples comparing to . Explain how closely the samples, of different sizes, appears to match the theoretical distribution. ```{r, message = FALSE} set.seed(30127) library(tidyverse) tibble(sample = rlnorm(30, 4, 2)) %>% ggplot(aes(sample = sample)) + geom_qq(distribution = qlnorm, dparams = c(4, 2)) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") tibble(sample = rlnorm(100, 4, 2)) %>% ggplot(aes(sample = sample)) + geom_qq(distribution = qlnorm, dparams = c(4, 2)) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") tibble(sample = rgamma(30, 3, 3)) %>% ggplot(aes(sample = sample)) + geom_qq(distribution = qgamma, dparams = c(3, 3)) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") tibble(sample = rgamma(100, 3, 3)) %>% ggplot(aes(sample = sample)) + geom_qq(distribution = qgamma, dparams = c(3, 3)) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") ```
The samples from the gamma distribution appear to match the true distribution very well even with relatively small samples. We see quite different behaviour with the log-normal due to its extremely heavy right tail. This means that essentially anything can happen over there and the median of the theoretical quantiles is not a good match for the observed extreme values. We can see why if we remember that a log-normal distribution is the exponential of a normal distribution. For large values, a small difference on the log-scale can still result in a very large difference on the exponentiated scale. This leads to the extreme behaviour we see here.
Q2 By drawing 100 samples of size 30 from the Gamma(3,3), visualise how variable the QQ-plot can be.
```{r} n <- 30 n_sim <- 100 experiments <- tibble( experiment = rep(1:n_sim, each = n), sample = rlnorm(n * n_sim, 4, 2) ) y <- rlnorm(n, 4, 2) experiments %>% ggplot(aes(sample = sample, group = experiment)) + geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "red") + geom_qq(distribution = qlnorm, dparams = c(4, 2), colour = "lightgrey", alpha = 0.5) + geom_qq(data = tibble(sample = y, experiment = 0), distribution = qlnorm, dparams = c(4, 2)) + scale_y_log10() + scale_x_log10() + xlim(0, 1e4) + ylim(0, 1e4) + theme_classic() ``` _We can see here the the log-normal samples are not surprisingly far from the line y = x._

Part 2 : Dealing with right-censored data in R

Right censored data is a common type of "time to event" data where the event is only observed for some of the observations. An example would be when the event is a death due to acute myelogenous leukemia. Some patients in the sample will die before the end of the study, while some (hopefully most!) will still have not died when the study ends.
In class, we talked about how important^[We talked about the statistical reasons for doing this, but when dealing with serious data about death, we are bound both morally and ethically to use our data as best we can.] it is to correctly account for both the events that we have observed and the ones that will occur, unobserved, in the future.
This data is so common that there is a packing in R that deals with it that you you don't even have to install. It's part of the basic R distribution. This package is, creatively, names survival^[Survival can deal with more complex censoring than just right censoring. It does a bunch of stuff that we are not going to look at in this class. It's a real industrial-strength thing.].
In order to deal with survival data, we need to keep track of two things:
  1. The observed time
  1. An indicator of whether this was an event or if we just stopped measuring (a censoring). You can check the help for other options, but this is usally coded as a 1 if it's an event and a 0 if it is not.
2.1 We can see this data structure in the aml dataset (the aforementioned leukemia) from the survival package.
```{r, echo = TRUE, eval = TRUE, message = FALSE} library(survival) library(tidyverse) head(aml) ``` This data contains an extra factor (`x`) that indicates what type of chemotherapy regime was used. We are going to use that to filter the data into subpopulations. One of the key functions in the survival package is `Surv`, which lets us put the time and event information together. ```{r, echo = TRUE, eval = TRUE} Surv(aml$time, aml$status) %>% head() ``` Notice that this separates the times into numbers (9, 13) and cenored times (13+), where the + indicates that the event occurs at some point after 13.
Let's break that down:
  • The Surv(time, status) term is there to link the time to event with the event indicator
  • The tilde (~) is a special character in R that we will see more of when we start moving into regression, but it exists to separate the data (the left of ~) from the "predictors" on the right. For today, we are just using ~ 1, which says "there are no predictors".
  • The data = aml tells the function which data to fit the survival curve on (and where to look for the columns time and status)
The fit object contains a bunch of stuff, but the easiest thing we can do is look at what it prints out as a default

2.1 This gives the number of observations, the number of observed events and an estimate with 95% confidence interval for the median survival time. We could've also gotten this median with the commnad

```{r, echo = TRUE, eval = TRUE} median(Surv(aml$time, aml$status)) ``` (The `50` is labelling the median as the 50% quantile. Is it terrible output? Yes.)

2.2 We can use the fit object to plot the Kaplan-Meier estimate of the survival curve. There are a few ways to do this, but we will use the survminer::ggsurvplot() function. To use this, you will need to install the survminer package.

```{r echo = TRUE, eval = TRUE} library(survminer) ggsurvplot(fit) ``` By default, this plots a confidence interval and you can turn this off with the option `conf.int = FALSE`.


Q1. You can plot the Kaplan-Meier curves for the sub-populations defined by the covariate x, by replacing ~ 1 with ~ x in the survfit call. Look at the resulting medians and Kaplan-Meier curves (use the same call to ggsurvplot).

```{r} fit2 <- survfit(Surv(time, status) ~ x, data = aml) fit2 ggsurvplot(fit2) ```
Write a few sentences commenting on the similarities and differences.
_The median survival time for the maintained events is larger than the non-maintained and the survival curves also show this ordering. However, there is very very little data here and we should be very reluctant to make statements without reflecting on the uncertaintyy here.
These are all the things that are computed when you call survfit.
  • time is the set of unique times in the data ($t_j$ in the lectures)
  • n.event is the number of events that were observed at each time ($d_j$ in the lectures)
  • n.censor is the number of censored observations at each time
  • n.risk is the number at risk at each time ($n_j$ in the lectures)
  • surv is the Kaplan-Meier estimate of the survival time.

Q2. We can check that we can compute surv from n.event and n.risk:

```{r, echo = TRUE, eval = TRUE} cumprod(1 - fit$n.event / fit$n.risk) - fit$surv ```
What does it mean that the above expression is (almost) zero
_It means that the output of the _cumprod expression is almost exactly the same as the survival curve, which suggests that the formula works!
Why isn't it exactly zero?
Computers are wonderful and magical things and one thing that happens is that computations with floating point numbers (numbers with decimals in them) are never perfectly exact. In fact, a computer can't tell the difference between numbers that are less than r .Machine$double.eps apart. So differences of the order of $10^{-17}$--$10^{-16}$ are extremely small and essentially zero.
What does cumprod do? (Also look up what cumsum does)
cumprod takes the cumulative product of a vector. It returns a vector of the same length as its imput and cumprod(x) would return c(x[1], x[1] * x[2], x[1] * x[2] * x[3], ...). cumsum does the same thing with sums instead of products.

Q3. How can you compute n.censor from this data?

```{r} n_censor <- aml %>% group_by(time) %>% summarise(n_censor = sum(status == 0)) %>% ungroup() %>% pull(n_censor) all(n_censor == fit$n.censor) ```

Q4. Computing the median

Finally, we can use the estimated survival curve to find the median. This is not unique except for in the very unusual situation where fit$surv is exactly 0.5 at one time. When there is no unique median, we have a number of options, two popular ones are:
  • Select the average of the two closest times with fit$surv on either side of 0.5
  • Select the time with fit$surv closest to 0.5.
For big enough data sets, both of these should give about the same value, but for small data sets the second option (which is what R uses) is slightly more reliable as it's possible that one of the two closest times has a survival estimate that is quite far from 0.5.
Let's look at this! ```{r, echo = TRUE, eval = TRUE} summary(fit) ``` The "closest" median is `27`, while the "average" median is `r (23+27)/2`. ### Questions - Write functions to find the median using both methods. They should take a `fit` as an input. ```{r} near_median <- function(fit){ if (length(fit$n) > 1) { stop("This only works for a single survival curve!") } index <- which.min(abs(fit$surv - 0.5)) return(fit$time[index]) } average_median <- function(fit) { if (length(fit$n) > 1) { stop("This only works for a single survival curve!") } suppressWarnings(lower_ind <- which.min(log(fit$surv - 0.5))) suppressWarnings(upper_ind <- which.min(log(0.5 - fit$surv))) return((fit$time[lower_ind] + fit$time[upper_ind])/2) } ```
There are a number of ways to do the second one. I went with a log transform that sends everything on the wrong side of the median to NaN _(which stands for Not a Number). This leads to some warnings so I wrapped it in a _surpressWarnings to get rid of them. A less lazy person would probably filter.

Q5. Consider the aml data when x == "Maintained". Compute the Kaplan-Meier curve for this subset of the data. Which of the two medians is most appropriate?

```{r} aml_maintained <- filter(aml, x == "Maintained") fit_maintained <- survfit(Surv(time, status) ~ x, data = aml_maintained) near_median(fit_maintained) average_median(fit_maintained) summary(fit_maintained) ``` _I would probably use the near median in this case as the upper number is quite a way away from 0.5._


Q6 Repeat the exercise for x == Nonmaintained.

```{r} aml_nonmaintained <- filter(aml, x == "Nonmaintained") fit_nonmaintained <- survfit(Surv(time, status) ~ x, data = aml_nonmaintained) near_median(fit_nonmaintained) average_median(fit_nonmaintained) summary(fit_maintained) ```

Q7 Look at the `diabetic` data (load it with `data(diabetic)`). Why is the median of this data not defined?

```{r} data(diabetic) fit_diabetic <- survfit(Surv(time, status) ~ 1, data = diabetic) tail(fit_diabetic$surv) ```
The survival estimate never gets to 0.5 due to the extreme amount of censoring in this data. This means that there is no way to define a sensible estimate of the median.

Fitting a parametric survival distribution

So far, we have only talked about non-parametric estimates of the survival curve. What if we think that the uncensored data came from a particular distribution? Not every distribution can be used to fit survival data in R, but three of the most useful options are
We fit these models using the survival::survreg function
```{r, echo = TRUE, eval = TRUE} fit2 <- survreg(Surv(time, status) ~ 1, data = aml, dist = "exponential") fit2 ```

What do the paraemters mean? The (Intercept) term is the log mean of the survival distribution (in this case the exponential). So the single unknown parameter in the exponential distribution can be computed as

```{r eval = TRUE, echo = TRUE} lambda <- exp(-coef(fit2)) ``` We can then look up what the median of an exponentially distributed random variable is.

Using this we can compare the parametric estimate of the median to the Kaplan-Meier estimate.

```{r, eval = TRUE, echo = TRUE} log(2) / lambda median(Surv(aml$time, aml$status))$quantile ```

We can also compare the survival curves.

```{r, echo = TRUE, eval = TRUE} km_curve <- ggsurvplot(fit, conf.int = FALSE) km_curve$plot + stat_function(fun = pexp, colour = "blue", line_type = "dashed", args = list(rate = lambda, lower.tail = FALSE) ) ```
The resulting survival curve is a prettty good match with the empirical estimate for early times, but after the median, the two estimates diverge wildly.

Fit the data using a Weibull distribution. And compare the median and the survival curve and comment on your findings.

The following code will be useful for translating between the survival package's parameterisation of the Weibull distribution and the standard R parameterisation.
```{r, echo = TRUE, eval = FALSE} ## fit3 is the weibull fit pweibull_shape <- 1 / fit3$scale pweibull_scale <- exp(coef(fit3)) ``` ```{r} fit3 <- survreg(Surv(time, status) ~ 1, data = aml, dist = "weibull") pweibull_shape <- 1 / fit3$scale pweibull_scale <- exp(coef(fit3)) pweibull_scale * log(2)^(1 / pweibull_shape) - median(Surv(aml$time, aml$status))$quantile km_curve$plot + stat_function(fun = pweibull, colour = "blue", line_type = "dashed", args = list(shape = pweibull_shape, scale = pweibull_scale, lower.tail = FALSE) ) ``` _This does slightly better (both with the median estimate and the shape of the survival curve), but I'm still not convinced by the right hand part of the survival curve. The wiebull distribution is not reflecting the almost linear decay of the Kaplan-Meier estimate of the survival curve._









ETC2420 - Tutorial solution (W1-3) CDF; Hypo Testing; Violin PlotETC2420 - Tutorial solution (W7-12) DAG ; Bayesian A/B Testing; Bayesian Regression; TidyModels

Table of contents
Jason Siu
A warm welcome! I am a tech enthusiast who loves sharing my passion for learning and self-discovery through my website.
Number of posts:
Table of contents