Quantitative Data Analysis
Assessment:
To assess student's understanding of the quantitative data analysis methods that underpin data science.
Learning Outcome 1: Demonstrate a practical understanding of core quantitative data analysis methods and data visualisation in data science application and research. (K, S)
Learning Outcome 2: Demonstrate skills in implementing these methods on real heterogeneous data using a software package and in critically evaluating and interpreting the results (S).
Learning Outcome 3: Critically reflect on data visualisation and the ability of various methods and techniques to effectively present value and insight. Evaluate the strength and the weaknesses of quantitative analysis methods alongside an understanding of how and when to use or combine methods. (C)
Data for this assingment have been randomised for each student. Each dataset is in the form data[num].csv where [num] is your student number. For example, if your student number is 12345678, the file will be called data12345678.csv.
Question 1. Petrol is manufactured using two test manufacturing processes, and it is desired to test whether the processes produce petrol with different specific energy. Data is provided in the file petrol[num].csv, which consists of a list of 20 samples taken at random from each process, and the specific energy (in MJ/kg) measured in the sample.
(a) Produce a boxplot to show the difference in specific enegy between the two processes. Comment on your boxplot.
(b) Perform a formal hypothesis test at the 5% level of significance for the hy- pothesis that the mean specific energy level differs for the two processes. State clearly your hypotheses, why you chose the particular test you have, and conclusions.
(c) If instead we wish to test whether the mean specific energy for process 1 is higher than that for process 2, how would our conclusions for part b) change?
(d) Find a 95% confidence interval for the difference in the mean specific energy levels for the two processes. Interpret carefully what this confidence interval means in this context.
(e) Does the the confidence interval in part d support your hypothesis test in part b? Justify your answer.
Question 2. An analyst for a cafeteria chain wishes to investigate the relationship between the number of self-service coffee dispensers in a cafeteria and sales of coffee. Four- teen cafeterias that are similar in their volume of business, type of clientele and location are chosen for the experiment. The number of dispensers varies from zero (coffee is only dispensed by serving staff) to six and is assigned randomly to each cafeteria. The results were as follows; sales are measured in hundreds of gallons of coffee sold. The results are in the sales.csv file
(a) Make an appropriate plot of the data, and fit a simple linear regression model to these data. Write down your model and your fitted values.
(b) What are the assumptions made for the simple linear regression model? Pro- duce residual plots for this model, and hence or otherwise check the assump- tions of the model.
(c) Suggest a possible improvement to the model based on what you have found and explain why you think it will be better.
(d) Fit your improved model and check the model assumptions. Use your model to predict the average volume of sales for a new cafe which opens with 5 dis- pensers.
(e) You are asked to write a brief summary of your findings for the marketing manager of the company. In non statistical language, briefly interpret your findings.
Question 3. A dataset on red wine quality is included as part of a dataset for modelling wine quality based on Physicochemical tests.
It is proposed that the quality of wine can be determined based on the follow- ing variables: Residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol.
All the wines are given a subjective mark of quality by the highly respected Uxbridge University Wine Board. It is of interest to the Wine Board to determine which factors are likely to be associated with a better quality wine.
(a) Produce appropriate plots and output to produce some qualitative and visual summary of the dataset, demonstrating the relationship between the ex- planatory variables and the quality. You should include:
• appropriate graphs labelled correctly;
• summary output and statistics;
• brief comments on how this output shows how the chemical variables relate to each other and the quality variable.
(b) Using any of the methods you have met in the course for regression, find a suitable model to predict the quality of the wine using the other variables. You should include:
• appropriate method(s) for selecting the best multiple regression model;
• appropriate checks and plots to show that the regression model fits the data correctly, and how you deal with any violations of the model as- sumptions;
• comments on how you identify and deal with any unusual observations;
• models involving transformations of the response (e.g. logarithms) where appropriate.
(c) Write a short report for the Chairman of the Wine Board, who has no statistical knowledge, indicating which factors are most important for determining a quality wine. You should include reference to your final model.
Attachment:- Quantitative Data Analysis.rar