Homework Assignment #3

Question #1

Using the built-in iris dataset:

  1. write a for() loop or use lapply() to test whether each numeric column in the iris dataset is normally distributed using the shapiro.test() function, and print out the p-value from this test you must write a for() loop or use lapply() to get credit for this question
  2. make a histogram of each iris numeric variable
  3. create a highly skewed variable by raising the values of iris$Sepal.Length to the 10th power (calculated like this: iris$Sepal.Length^10).
  4. make a histogram of this highly skewed variable
  5. calculate the mean and median of this highly skewed variable
  6. explain in words WHY these mean and median values are so different.

Question #2

The zipped folder genetic_sequences.zip contains 100 .csv files, each of which contains fake genetic sequence data. Each file represents a partial genetic sequence for an individual person. Unzip this folder, then do the following:

  1. On average, what is the proportion of Guanine (G) represented in each person’s genetic sample? (Hint: first think about how you would do this for 1 of the csv files, and write the code to do so. Then write a for() loop or use lapply() to apply your code to all 100 files. Hint2: you can use this code list.files("path/to/folder",full.names=TRUE) to list the complete file path for all files in a folder. This will be helpful when trying to loop over all 100 files!)
  2. make a histogram of the proportion of Guanine (G) represented in each person’s genetic sample.
  3. Describe the shape of the distribution in part 2b. Is this distribution normal (i.e Gaussian, i.e. is it a Bell curve?). Justify your answer.