#' # HW2: Stats in R
#' 
#' Your name: _______
#' 
#' Instructions: Please complete this assignment on 
#' your own without using AI tools or outside solution 
#' websites. The goal is for you to build your own 
#' understanding. If you get stuck, that’s completely 
#' expected: come to office hours (Monday, Wednesday, 
#' and Friday at 12:30pm on Zoom) and I’ll help you 
#' work through it.
#' 
#' ## Random Variables
#' 
#' A **random variable** is any value that cannot be 
#' predicted exactly. For instance, these are all random
#' variables:
#' 
#' - the time it takes you to find your keys
#' - the message in a fortune cookie
#' - a dice roll
#' - the number of people who enter a store in an hour
#' - a stock return
#' 
#' In R, we can **simulate** a sample from a random 
#' variable using functions like `rnorm()` (generates 
#' random numbers from the normal distribution) and 
#' `runif()` (generates random numbers from the uniform 
#' distribution).
#' 
#' For example, suppose it takes you between 0 and 10
#' minutes to find your keys every morning. If that
#' value follows a uniform distribution (that is, any
#' number between 0 and 10 is equally likely), we can
#' generate a sample using `runif()`:

runif(n = 10, min = 0, max = 10)

#' This generates a **vector**. Notice that when you
#' run the line above multiple times, you'll get 
#' different numbers every time: they're random.
#' 
#' ## Question 1: Generate a sample of random numbers 
#' from the normal distribution with mean 0 and standard
#' deviation 1. Run `?rnorm` to read the help docs
#' on the function `rnorm()`.

?rnorm

____

 
#' ## Estimators
#' 
#' Given a sample, we can **estimate** different 
#' properties of the random variable, like the random 
#' variable's expected value and variance. The best 
#' estimator for a random variable's expected value 
#' is its sample mean (`mean()`). The best estimator 
#' for a random variable's variance is its sample 
#' variance (`var()`).
#' 
#' For example: let's generate a sample from the uniform
#' distribution, and then estimate the expected value
#' using the sample mean:

library(tidyverse)

runif(n = 10, min = 0, max = 10) %>%
  mean()

#' Run the code chunk above several times: you should
#' get numbers *around* 5: the true expected value of 
#' a random variable from the uniform distribution 
#' from 0 to 10 is equal to 5.
#' 
#' ## Question 2: how large of a sample do you need
#' to get an estimate for the expected value within 
#' 0.01 of the true expected value?

runif(n = ___, min = 0, max = 10) %>%
  mean()

#' ## Question 3: generate a small sample from the
#' normal distribution with mean 0 and standard 
#' deviation 1, then calculate its mean. How large
#' a sample do you need to get for the mean to be
#' within .01 of the true expected value?

____


#' ## Hypothesis Testing with lm: p-values
#' 
#' In Econometrics, you learned about using OLS
#' (ordinary least squares) to estimate a linear model.
#' Let's simulate this in R by:
#' - Creating a variable `education`: let's assume
#'   people's educations are random uniform between
#'   0 and 16 years. Read the assignment operator
#'   `<-` as "gets": that is, the name "education"
#'   gets the vector created by `runif(n = 100, min = 0, max = 16)`.
#'   After you run the line below, `education`
#'   becomes a variable in your Global Environment,
#'   which refers to a vector of random numbers.

education <- runif(n = 20, min = 0, max = 16)

#' - Next we'll create another variable `u` which
#'   represents idiosyncratic shocks (random noise).

u <- rnorm(n = 20, mean = 0, sd = 50)

#' - Next, let `earnings` be `40 + 10 * education + u`:
#'   that is, earnings is a linear function of education
#'   with a y-intercept of 40 and a slope of 10.

earnings <- 40 + 10 * education + u

#' - Finally, we can use `lm()` to estimate the linear
#'   model of `earnings` as the dependent variable with 
#'   `education` as the explanatory variable. 
#'   `broom::tidy()` tidies the regression output and
#'   lets us see p-values to help us assess the 
#'   statistical significance of the regression 
#'   coefficients:

lm(earnings ~ education) %>%
  broom::tidy()

#' ## Question 4: Interpret the regression output
#' first by reading the "estimate" column. The true
#' value for the y-intercept was 40; you estimated
#' the y-intercept to be ____. The true value for
#' the slope (effect of a one-year increase in 
#' education on earnings) was 10; you estimated it to
#' be ____.
#' 
#' The next three columns in the `broom::tidy()` 
#' output all have to do with the statistical 
#' significance of those y-intercept and slope 
#' estimates. In particular, the `p.value` column 
#' gives you the probability that random noise alone 
#' could have produced a result as strong as the one 
#' you observed, if the true relationship were actually 
#' zero. So if you get a large p-value, it's saying 
#' that the data is noisy and the coefficient could 
#' very well have been driven by that noise and not a 
#' strong relationship: we say the estimate is
#' **not statistically significant**. But if you get
#' a small p-value, that says the relationship you
#' estimated is not likely to have been produced by
#' noise alone: a strong relationship seems to exist.
#' If the p-value is less than .05, we say the estimate
#' **is statistically significant at the 5% level**.
#'
#' ## Question 5: In your `broom::tidy()` output,
#' is the y-intercept statistically significant?
#' What about the slope?
#'
#'
#'
#' Last step: compile this document to html and upload
#' the html file to Canvas.
#'