StatiticsAssignment.docx

Statistics Assignment

The purpose of this assignment is to give you an opportunity to demonstrate your skills in describing and analysing data using concepts and tools that we have developed in the course so far.

Below are instructions on how to collect a specified set of data and what to do with it.  Your goal is to produce a report in MS Word discussing the data and submit this along with a single MS Excel workbook showing your workings.   A suggested target range for the word count of the report is 700-1000 words.

I have prepared and attached an example Excel workbook which I will refer to below. Note: my Excel workbook is not a model answer.  You may choose to use different visualisations and do not necessarily need all the computed statisitcs and charts I have included.  It really depends on the features of the data you have, so you need to use your own judgement as to how to best present and describe the data.  Besides, the primary output for this assignment is the report itself, not the workbook.

Data collection

Collect quantitative data on two variables from the Sustainable Development Report 2021 website.

· Go to  and browse around the site to become familiar with its purpose and the information publicly available there.

· Go to the "Downloads" page and click on "Database EXCEL" to download the database of indicators used to assess countries' progress towards the UN Sustainable Development Goals.  You will be taking data from the "SDR2021 Data" sheet in the workbook. Starting from column AR of that sheet, there are columns of cross-country data for the SDG indicators, one row for each country.  Note that from row 195 the data are for regional blocs so these should be excluded from the data you take.

· You have been assigned two variables according to the last digit of your Student ID number.  You can find the variables assigned to you in the attached file "Assigned variables.xlsx".  For example, my student ID number (a long long time ago in a galaxy not so far away) ended with 2 so I would be using variables "Poverty headcount ratio at \$1.90/day (%)" and "Cereal yield (tonnes per hectare of harvested land)".  I have chosen pairs of variables that may potentially have a statistical relationship.  If you wish you are welcome to switch one of the variables with another one from the database that you are interested in investigating and you think is related to the variable you retain.

· P.S: My student number ends with 7. Therefore, my topics are

 7 Corruption Perception Index (worst 0-100 best) 16 Population with access to clean fuels and technology for cooking (%) 7

· Look up your variables in the Data Explorer on the website or in the report from page 75 (some newer variables are not included on the website yet, it seems).  The main thing you want to understand is what a given value of each of your variables means.  E.g. I found that the "Poverty headcount ratio at \$1.90/day (%)" is the estimated percentage of the population that is living under the poverty threshold of US\$1.90 a day.

· Each indicator has a number of associated columns in the Database workbook, the first column of the set has the data, the others can be safely ignored.  So for the indicators I use in my example Excel workbook, I took data from columns QZ and GO (and only down to row 194).  Using Excel's Find tool is a quick way to find your data.  Copy and paste the data you will use into your workbook.  You should keep the country names alongside the data so that you can identify which observation is for which country.

· Construct separate univariate data sets for analysis.  There will probably be many countries where there is no estimate for the indicators you are looking at.  If there is no observation recorded, do not assume the observed value is zero.  In general, missing observations in data rarely mean they should be replaced with zeros.  Also consider if it is appropriate to include observations that are recorded as zero.  In my example Excel workbook I have retained observations of zero for CO2 emissions because this suggests those countries are not exporting fossil fuels, while blank cells mean there is no observation.  It is fine to have blank cells within your data ranges, Excel will usually ignore them (as long as they are truly blank).

· Construct a bivariate (paired) data set - i.e. for each country you should have an observation for both variables.  You can see in my example Excel workbook how I use some Excel formulas and the Replace tool to blank out cells for countries where there is only an observation for one of the two variables.  If you find that the number of countries that you have left in the bivariate data set is low, say less than 30, it might be best to go back to the Database and replace the variable that is causing many countries to be dropped.

· In your report you should note any difficulties with the data preparation and implications of dropping countries from the data sets if such was required.

An educated guess

Guess the average value for each variable.

· Run your eye down the column of univariate data you have for each variable (the separate data not the paired data), and make a guess what you think the cross-country average would be for each one.  Do not use Excel to calculate the averages here.

· Just take a note of your guesses; you will use them later.

Data description

Use numerical summary measures and graphical representations to describe the two variables (using the separate data)

· You can use the "Descriptive Statistics" tool in the data analysis tool pack and also calculate quartiles, coefficients of variation etc.

· Draw a histogram, boxplot, etc. for each data set.

· You should discuss the important and interesting features of the data revealed by your descriptive statistics and graphical representations in your report.  In my example workbook you will see the CO2 data is strongly positively skewed, so much so that the boxplot is almost meaningless.  Two options I had was to drop some of the largest observations, or to transform the data.  I chose the latter - by taking the log of the data I end up with a data set distribution that can be usefully presented on a boxplot or histogram.  Outliers and skewness are common features of cross-country data like this, so you should be prepared to drop observations or transform data if necessary, and explain why you did this in your report.  (Just because data is skewed doesn't mean you have to transform it!  You'll notice I did not transform the literacy data.)

Use numerical summary measures and graphical representations to consider if there might be a relationship between the two variables (using the paired data).

· Use the correlation coefficient and a scatterplot to see the strength and direction of the relationship (if any) between the two variables.

· In your report, discuss the above and explain why you think the relationship might be causative, spurious, or driven by a third factor.

Data analysis

Construct confidence intervals (using the separate data).

· Now assume that the data for each variable is a random sample and construct a confidence interval for the population mean of each variable.  Since you don't know the population standard deviations you should use critical values from the Student t-distribution.

· State your confidence interval in your report, explaining what it means (to a layperson) and also discuss if you have any doubts about the validity of the interval.

Compute p-values (using the separate data).

· Now assume that your "educated guess" of the average for each variable is the true mean of that variable.  How likely is it that you would observe the sample mean you have obtained, or something more extreme, if your parameter assumption for each variable is correct?  I.e. find the two-tail p-value associated with each sample mean.You can obtain the p-value by doing a two-tail hypothesis for the mean, for each data set.

· State the p-values in your report and explain their meaning.  Conclude by stating whether your educated guesses were probably right or wrong.  (There is no penalty if your educated guesses are wrong!)

Report

As noted above, your assignment output should consist of a report and a spreadsheet workbook.  Imagine that the reader of your report is a busy executive with only a basic understanding of statistics.  Your report should therefore be of professional appearance and be able to be fully understood without reference to the workbook.  I.e. paste relevant charts into the report; do not paste the full descriptive statistics table into the report but rather use an abridged table and/or discussion; do not show the computation of the confidence intervals and p-values in the report but do state and interpret them.

Remember the suggested word count is 700-1000 words but this is a guide only: if you accomplish everything required above with less, that is fine; ideally don't go much over 1000 - this would indicate you are not being concise enough.

Finally, I have attached some collated feedback I provided to students last year.  You may like to refer to this to see what I am hoping to see in your report.