suppressMessages(library(tidyverse))

Part I - Monte Carlo

Problem 1a

Compute the following integral using Monte Carlo integration (provide estimate and estimated standard error):

\[ \int_0^2 \frac{\cos(x(2-x))}{1+x+\sqrt{x}}dx\]

xmin <- 0
xmax <- 2

ymin <- 0
ymax <- 1

integrand <- function(x){
  
  top <- cos(x*(2-x))
  bottom <- 1 + x + sqrt(x)
  return(top / bottom)
  
}

S <- 100000

xs <- runif(S, min = xmin, max = xmax)
ys <- runif(S, min = ymin, max = ymax)
under_curve <- (ys <= integrand(xs))

x <- seq(xmin, xmax, length.out = 1000)
y <- integrand(x)


ggplot() + 
  geom_point(aes(x=xs[under_curve], y=ys[under_curve])) +
  geom_line(aes(x=x, y=y), col="red", size=2) +
    geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), color = "blue", alpha=0, size = 2)

fraction_area <- sum(under_curve)/S
fraction_area
## [1] 0.29505
  • The estimate of the ratio between the area under the red curve and the area of the blue rectangle is approximately 0.29.

  • We know the area of the blue rectangle is 2. Thus we compute the estimate of the area under the red curve as follows

curve_area = 2*fraction_area
curve_area
## [1] 0.5901
  • This area signifies an estimate of the integral.

  • Let’s compute the mean and standard error of the integral for a series to get a more precise estimate of the integral

MC_fraction <- function(integrand, S){
  
  xs <- runif(S, min = xmin, max = xmax)
  ys <- runif(S, min = ymin, max = ymax)
  under_curve <- (ys <= integrand(xs))
  
  return(sum(under_curve)/S)
  
}


S <- 10000
N <- 10000
areas_seq <- c()

for (val in 1:N){
  area <- 2*MC_fraction(integrand, S)
  areas_seq <- c(areas_seq, area)
}

mc_estimate <- mean(areas_seq)
mc_se <- sd(areas_seq) / sqrt(length(areas_seq))

paste("The estimate is:", mc_estimate)
## [1] "The estimate is: 0.59076656"
paste("The error is:", mc_se)
## [1] "The error is: 9.09432704738753e-05"

Problem 1b

Compare your estimate to quadrature integration using integrate()

integrate(integrand, xmin, xmax)
## 0.5907916 with absolute error < 7.9e-05
  • MC integration: 0.5909

  • Quadrature integration: 0.5908

  • The Monte Carlo integration estimates the integral to approximately 0.0001 accuracy when compared to the quad integration

  • The errors on both integration methods are on the same order of magnitude of approximately \(10^{-4} = 0.0001\)

Problem 2

Estimate the power of the Student’s \(t\)-test for difference in means for two normally distributed populations.

The Student’s \(t\)-test is a standard test for difference in means between two normal populations. In this exercise, we will estimate the power of \(t\)-test using Monte Carlo estimators.

Assume two normal populations with a difference in means of diff_mean. The first sample with n1 = 30 observations, mean = mean1, sd = sd1 and the second sample with n2 = 20 observations, sd = sd2 and mean = mean1 + diff_mean.

Problem 2a

Write a function that returns a data frame of 50(= n1 + n2) rows, where the first column is named smpl and contains the normal draws from both populations, add a second column named pop indicates which population an observation came from (taking values of 1 or 2). The inputs for this function should be n1, mean1, sd1, n2, sd2 and true_diff.

n1 <- 30; mean1 <- 0; sd1 <- 1
n2 <- 20; sd2 <- 1.5
true_diff <- 0.5

return_df <- function(n1, mean1, sd1, n2, sd2, true_diff){
  
  # Initialize empty data frame
  df <- data.frame(matrix(ncol = 2, nrow = n1 + n2))
  colnames(df) <- c("smpl", "pop")
  
  # Draw from normal distributions
  sample1 <- rnorm(n1, mean1, sd1)
  sample2 <- rnorm(n2, mean1 + true_diff, sd2)
  
  # Assign the normal draws to the data frame
  df$smpl[1:n1] <- sample1
  df$smpl[(1+n1):(n1+n2)] <- sample2
  
  # Assign the population indicator to the data frame
  df$pop[1:n1] <- replicate(n1,1)
  df$pop[(1+n1):(n1+n2)] <- replicate(n2, 2)
  
  # Return the data frame
  return(df)
}

df <- return_df(n1, mean1, sd1, n2, sd2, true_diff)
head(df, 6)
##         smpl pop
## 1  1.6468052   1
## 2 -1.8520940   1
## 3  2.2816039   1
## 4  0.8263027   1
## 5  0.8471339   1
## 6  0.7220557   1

Problem 2b

The hypotheses we want to test by this simulation study is:

\[H_0: \text{mean}_1 - \text{mean}_2 = 0\] \[H_a: \text{mean}_1 - \text{mean}_2 \neq 0\]

at level \(\alpha\) (let us fix alpha = 0.05 for the rest of the exercise)

We can use a standard t-test for that:

alpha <- 0.05
t_test_result <- 
  t.test(
    x = df$smpl[df$pop == 1],
    y = df$smpl[df$pop == 2],
    conf.level = 1 - alpha
  )

The decision rule is to reject if:

t_test_result$p.value < alpha
## [1] FALSE

Write a function that performs this \(t\)-test for an input data frame that is define in the same way as ours (a smpl column for sample, a pop column for population (taking values of 1 or 2)). A second input alpha should be the level of the test.

The function’s output should be a logical value - whether or not the null is rejected.

run_t_test <- function(df, alpha){
  
  t_test_result <- 
  t.test(
    x = df$smpl[df$pop == 1],
    y = df$smpl[df$pop == 2],
    conf.level = 1 - alpha
  )
  return(t_test_result$p.value < alpha)
}

run_t_test(df, alpha)
## [1] FALSE
  • The decision rule to reject the null hypothesis in favor of the alternative hypothesis is if the p-value is less than alpha.

  • In this problem, the function returns FALSE which means the p-value is greater than or equal to alpha, signifying that we fail to reject the null hypothesis in favor of the alternative hypothesis.

Problem 2c

Write a function that estimates the power of the \(t\)-test for these parameters and this difference in means. That is, the function should take as inputs S, the number of tests to perform (which in this case is the number of Monte Carlo samples), and the parameters n1, mean1, sd, n2, sd2, true_diff, and alpha.

The function will generate S different data frames simulated from the same set of parameters (S = 500 is a reasonable size, but you may increase it for better accuracy). It will test each of those using the t-test and measure the rejection rate for the test. This rate is an estimate of the power for this set of parameters. The function will return a vector so that each of its elements is a power estimate.

power_estimate <- function(S, N, n1, mean1, sd1, n2, sd2, true_diff, alpha){
  
  powers <-  c()
  
  for (i in 1:N){
    
    boo <- c()
    
    for (j in 1:S){
      
      df <- return_df(n1, mean1, sd1, n2, sd2, true_diff)
      boo <- c(boo, run_t_test(df, alpha))
    }
    
    # Rejection of null represented by TRUE
    n_true <- sum(as.integer(boo))
    power <- n_true/S
    
    powers <- c(powers, power)
  }
  
  return(powers)
}

S <- 2000
N <- 50

powers <- power_estimate(S, N, n1, mean1, sd1, n2, sd2, true_diff, alpha)

mean(powers)
## [1] 0.24596
sd(powers)/N
## [1] 0.0001813199
  • The mean rejection rate is approximately 25%

More power signifies that we have a better test that is better able to identify a difference between the normal means when it is actually there. However, it is also important to verify that our test does not reject the null more than it should (type_I error should be less than or equal to alpha). If that happens, the rejections in other cases cannot be considered reliable.

powers_typeI <- power_estimate(S, N, n1, mean1, sd1, n2, sd2, 0, alpha)

mean(powers_typeI)
## [1] 0.0503
sd(powers_typeI)/N
## [1] 9.491134e-05
  • With true_diff = 0, the normal means are the same. Thus, we would expect the null hypothesis should not be rejected. Meaning, the resulting rejection rate (or power) should be zero.

  • We see that the mean rejection rate (our type I error) is approximately alpha = 0.05. However, it is slightly above alpha which signifies our test rejects the null hypothesis more than it should at the approximate rate \(0.0005 \pm 0.0001\).

  • It is possible the type I error can be reduced by increasing the sampling sizes S, N.

Problem 2d

Power Study: we will now extend the above to a full blown “power study”. That is, we check the power of the test for different levels of true difference in means (take a following sequence: seq(-2, 2, 0.2) to be the domain of true difference values). For each level of true difference in means, record the estimated power and conclude by plotting the estimated power vs. the true difference in means.

S <- 1000
N <- 20
diff_means <- seq(-2, 2, 0.2)

mean_powers <- c()
errors <- c()

for (true_diff in diff_means){
  
  powers_temp <- power_estimate(S, N, n1, mean1, sd1, n2, sd2, true_diff, alpha)

  mean_powers <- c(mean_powers, mean(powers_temp))
  errors <-  c(errors, sd(powers_temp)/N)
}

The results satisfy what we expect:

  1. As the absolute value of the true difference in means grows, a good test will have more power (rejection rate closer to 1).

  2. Since the \(t\)-test and the normal distributions are symmetric, we expect the power curve to be symmetric around 0.

  3. Where the difference in means is 0, the power should be below or at alpha.

ggplot() + 
  geom_point(aes(x=diff_means, y=mean_powers)) + 
  geom_errorbar(aes(x=diff_means, y=mean_powers, ymin = mean_powers - errors, ymax = mean_powers + errors), width = 0.2, position = position_dodge(0.05)) +
  geom_line(aes(x=diff_means, y=mean_powers))

print(min(mean_powers))
## [1] 0.0516

Part II - Bootstrap

Estimate standard error of robust psi regression estimators.

In lecture 8 we developed a “robust” regression based on fitting the psi function. In the lecture, we fitted the parameters using our gradient descent method. In the interest of time, fit the parameters here using optim.

library(numDeriv)
x <- seq(-5,5,0.01)

psi <- function(x, c) {
  return (
    ifelse(
      abs(x) > c,
      2 * c * abs(x) - c^2, 
      x^2
    )
  )
}

fit1 <- optimize(psi, c(-5,5), c = 1.2)
fit1
## $minimum
## [1] 4.440892e-16
## 
## $objective
## [1] 1.972152e-31
#fit2 <- optim(par = 1, fn = psi, x=x, method="Brent", lower = 0, upper = 1)
#fit2

The optimization procedure provides us with point estimates, but we have no measure of uncertainty for them: neither a standard deviation nor confidence intervals to determine whether or not they are different from 0 with any statistical certainty. The bootstrap method allows us to do so.

Use the same data and regression model as before, data being MASS::cats and mdoel being Hwt (heart weight) as response and Bwt (body weight) as the single predictor.

Problem 1a

Estimate the coefficients of a simple linear regression where the fitting is based on minimizing the mean of the psi-error.

cats <- MASS::cats
c <- 1

mean_psi_regression <- function(data, par) {
  
  X <- data$Bwt
  Y <- data$Hwt
  return( mean ( psi(Y-(par[1]+par[2]*X), c) ) )
  
}

# Compute initial guesses
b0_i <- coef(lm(Hwt ~ Bwt, data = MASS::cats))["(Intercept)"]
b1_i <- coef(lm(Hwt ~ Bwt, data = MASS::cats))["Bwt"]

result <- optim(par = c(b0_i, b1_i), fn = mean_psi_regression, data = cats)

b0 <- result$par["(Intercept)"]
b1 <- result$par["Bwt"]

cat(paste("b0 = ", b0, ", b1 = ", b1, ", c = ", c, sep=""))
## b0 = -0.142182314168739, b1 = 3.91029029507439, c = 1

Problem 1b

Construct a bootstrap sample for the estimated coefficients of a simple linear regression where the fitting is based on minimizing the mean of the psi-error.

n <- nrow(cats)
M <- 10000 # number of draws

# sample for estimated regression coefficients
est_coeffs_bstp_sample <- data.frame(b0 = numeric(M), b1 = numeric(M))

for (m in 1:M){
  
  # sample indices with replacement
  bootstrap_sample_idx <- sample(n, replace = TRUE)
  
  # bootstrap sample
  bstp <- cats[bootstrap_sample_idx, ]
  
  coeffs <- optim(par = c(b0_i, b1_i), fn = mean_psi_regression, data = bstp)
  
  est_coeffs_bstp_sample[m,1] <- coeffs$par["(Intercept)"]
  est_coeffs_bstp_sample[m,2] <- coeffs$par["Bwt"]
  
}

head(est_coeffs_bstp_sample, 6)
##           b0       b1
## 1  0.9535093 3.541712
## 2 -0.3523900 4.008991
## 3 -0.1231361 3.987214
## 4 -1.3032893 4.302527
## 5 -0.3777716 3.943214
## 6 -0.6306157 4.057321

Problem 1c

Estimate the standard deviations of each of the coefficients

sd_b0 <- sd(est_coeffs_bstp_sample$b0)
sd_b1 <- sd(est_coeffs_bstp_sample$b1)

cat(paste("sd_b0 = ", sd_b0, ", sd_b1 = ", sd_b1, sep = ""))
## sd_b0 = 0.826385041305366, sd_b1 = 0.314806899831392

Problem 1d

Construct 95% confidence intervals for the coefficients and, based on those, determine whether or not we can reject a null hypothesis that each of them is equal to zero.

alpha <- 0.05

# CI for b0
(bootstrap_CI <- quantile(est_coeffs_bstp_sample$b0, c(alpha/2, 1 - alpha/2)))
##      2.5%     97.5% 
## -1.800947  1.482706
# CI for b1
(bootstrap_CI <- quantile(est_coeffs_bstp_sample$b1, c(alpha/2, 1 - alpha/2)))
##     2.5%    97.5% 
## 3.306057 4.559379
  • Because zero is in the confidence interval for b0, we fail to reject the null that b0 = 0 at 95% confidence.

  • Because zero is not in the confidence interval for b1, we reject the null that b1 = 0 at 95% confidence.