Question #1
Using the built-in iris dataset:
- write a for() loop or use lapply() to test whether each numeric column in the
irisdataset is normally distributed using theshapiro.test()function, and print out the p-value from this test you must write a for() loop or use lapply() to get credit for this question - make a histogram of each
irisnumeric variable - create a highly skewed variable by raising the values of
iris$Sepal.Lengthto the 10th power (calculated like this:iris$Sepal.Length^10). - make a histogram of this highly skewed variable
- calculate the mean and median of this highly skewed variable
- explain in words WHY these mean and median values are so different.
Question #2
The zipped folder genetic_sequences.zip contains 100 .csv files, each of which contains fake genetic sequence data. Each file represents a partial genetic sequence for an individual person. Unzip this folder, then do the following:
- On average, what is the proportion of Guanine (G) represented in each person’s genetic sample? (Hint: first think about how you would do this for 1 of the csv files, and write the code to do so. Then write a
for()loop or uselapply()to apply your code to all 100 files. Hint2: you can use this codelist.files("path/to/folder",full.names=TRUE)to list the complete file path for all files in a folder. This will be helpful when trying to loop over all 100 files!) - make a histogram of the proportion of Guanine (G) represented in each person’s genetic sample.
- Describe the shape of the distribution in part 2b. Is this distribution normal (i.e Gaussian, i.e. is it a Bell curve?). Justify your answer.