1) Describe what is being measured and the level of measurement for the following variables: P344pr, A093r, SexHRP, A094r, and G018r.

Answer:

The living costs and food survey (LCF) explore information on various households’ budgets in UK as reflected by data gathered from expenditure and cost of living. It cut across the UK every on annual basis making the most substantial source expenditure information. The dataset contains several variables, which can be either qualitative or quantitative in nature. The variables P344pr, A093r, SexHRP, A094r, and G018r

are part of the LCF2013 dataset. The variable P344pr represents the gross normal weekly household income and it is continuous with interval level of measurement. A093r represents economic position of referenced person and it is categorical nominal variable, which assumes the national statistics statistic socioeconomic classification of household. The categories includes economically inactive, unemployed and work related government training programs, part-time working, and full-time working. SexHRP represents the gender of referenced household member, which is a dichotomous nominal variable (male or female). A094r represents the reference person, that is, the occupation capacity or type held by the participant. The categories include higher managerial, administrative and professional occupations; intermediate occupations; never worked and long term unemployed, students and occupation not stated; not classified for other reasons; and routine and manual occupations. The reference person is a categorical nominal variable. G018r

is the number of adults in the household, that is, the count of all persons aged 18 years and above in the household; therefore it is a discrete variable, which take the ratio level of measurement.

2) Using the appropriate measures, report and interpret the central tendency and dispersion for P550tpr, P425r, A121r, and G019r. You should report your output in a table or plots.

Answer:

Measure of central tendency and dispersion are important in making explorative inquiry of the data of interest. The measures of central tendency includes the mean, median and mode while measures of variability or dispersion entails the variance, standard deviation and range. In addition, descriptive plots or rather charts can be used to represent the data provided. The variables needs to summarized includes P550tpr, P425r, A121r, and G019r, which actually measures total weekly household expenditure, main income source of household income, household tenure and number of children in the household respectively. Table 1 below indicate measures of central tendency and dispersion for the total weekly household expenditure. The statistic provided in the table indicates that the mean total household expenditure = 479.76 (SD = 292.37) on a weekly basis. The median weekly total household expenditure = 419.90.

Table 1: Descriptive statistics for the total weekly expenditure

Total weekly expenditure summary

Statistics

Value

Mean

479.76

Standard deviation (SD)

292.37

Median

419.90

Min

30.52

Max

1175.00

Range

1144.48

The other three variables explored variables are in categorical variables, which are exploring using frequency charts such as bar and pie charts. In particular, home tenure and main source of household income are displayed using pie charts while the number of children in the household is represented using a bar chart.

Figure 1: Pie chart for home tenure

The pie chart for home tenure indicates there is three types of tenures namely owned, public rented, and private rented. The proportion of participants who home is the highest followed by those rented public homes while those in private rentals take the least proportion.

Figure 2: Pie chart for main source of household income

The preceding pie chart indicate the distribution of households based on the main source of the household income. The larger of the two portions are the households whose main source is earned income.

Figure 3: Bar chart for number of children in a household

The above bar chart indicates the distribution of households based on the number of children in a particular household. There are three categories of the households including those with no children, one child, and two or more children. The bar chart indicates that a substantial number of households do not have children but households with two or more children are more than those that have only one child.

3) Graphically display the distribution of P344pr by Gorx. How does P344pr vary between and within Gorx? Interpret your results.

Answer:

This question explore the distribution of the gross normal weekly household income when grouped by the government office region. A side-by-side boxplot was constructed to explore the relationship between the two variables graphically. The figure below displays the side-by-side plots for the gross normal weekly income grouped by government office region where the household is located. Note the missing region labels from the right are North West and Merseyside, Scotland, and Yorkshire and the Humber. The side-by-side boxplots indicates that East Midlands and North East, North West and Merseyside, South West, and West Midlands have a relatively small within variation compared to the rest regions. London has the highest within variation while the North East seems to have the least. The between the regions the London and South East have the highest gross normal weekly income while North East remains down there with least weekly income.

Figure 4: side-by-side boxplot for gross normal weekly income by government office region

PART B: Inferential Statistics: Confidence intervals, chi square and t-tests

1) Calculate and interpret a 95% confidence interval for the sample mean of P550tpr. Explain your working.

Answer:

The 95% confidence of interval of the sample mean of total weekly household expenditure is given by calculating the sample mean and the margin of error. The margin error is given by product of two-sided t-statistic and standard error of the sample mean. The t-statistic is calculated at the 95% level of confidence and n-1 degrees of freedom. From the R-outputs: Sample mean = 479.77, n =5144, df = 5143, and margin of error = 7.991. The resulting 95% confidence interval is given by subtracting and adding the margin of error from and to the mean to get the lower limit and upper limit of the total weekly household expenditure. Therefore, the 95% CI for total weekly household expenditure is (471.77, 487.75). In conclusion, there is a 95% likelihood that the true mean of total weekly household expenditure fall between 471.77 and 487.75 with the assumption that the random variable follow a normal distribution.

2) Calculate and interpret a 99% confidence interval for the sample proportion working full time (A093r) of those in employment or looking for work (i.e. A093r!= "Economically inactive"). Explain your working.

Answer:

The first step in calculating the 99% confidence interval for the sample proportion of those economically inactive (A093r!) is calculating the proportion point estimate (sample proportion itself). The second step involves obtaining the margin of error for proportion. The margin of error is obtained calculating the product of the two-sided z-value at the 99% confidence level and standard error of the proportion. The point estimate is given by frequency of the economically inactive persons divided by the total responses. The calculated point estimate of the proportion =0.3896, that is, 38.96% of the respondents in 2013 LCF survey were economically inactive during the time of study. The standard error of the proportion is approximately = 0.0068. The corresponding margin of error at the 99% confidence level is 0.0175. The lower limit and upper limit of the economically inactive proportion of respondents is 0.3721 and 0.4071 at the 99% level of confidence.

3) Create and report a cross tabulation between G018r and A121r for those living in the North West and Merseyside region. Describe any patterns observed in the table and determine if there is a statistically significant association.

Answer:

Statement of hypothesis:

Null hypothesis: home tenure and number of adults in the household are independent variables

Alternative hypothesis: Home tenure and number of adults in the household are significantly associated.

The first step in performing the cross tabulation analysis is create a subset of the original dataset. In this case, the subset function help create new data file that select only the ID number, and the two variables of interest for only responses that were collected from North West and Merseyside region. Then the table function helps in a making a two-way contingency table for the two categorical variable, that is, the household tenure and number of adults in the household. Further, the chi square test function is used to determine whether the household tenure and number of adults in the household are independent variables. The R output cross tabulation and chi-square test are provided below:

Table 2: Crosstab for home tenure and adults in the household

Home Tenure

Adults in the household

Owned

Private rented

Public rented

1 adult

97

46

60

2 adults

234

41

45

3 adults

33

4

6

4 and more adults

14

0

5

Table 2 indicates that overall there is quite a big number of owned homes where the household have two adults. Interestingly, none of private rented homes has a household with four or more adults. Overall, it is important to note that owned homes have the highest frequency for all household with various number of adults followed by public rented home and the private rented home comes last. The chi-square statistic =42.085, DF=6, p-value < 0.05 for the test of independence. Since the p-value < 0.05, the researcher rejects the null hypothesis and conclude that there is a statistically significant association between the home tenure and number of adults in the household in North West and Merseyside region.

4) Report the strength of the association using the appropriate test. Interpret what this can tell you about the relationship between G018r and A121r.

Answer:

The effect size of the association between household tenure and number of adults in the household = 0.268. The effect size indicate that there is moderately strong relationship between the two variables.

5) Report and interpret the mean gross normal weekly household income (P344pr) for those who work (full-time or part-time) in higher managerial, administrative and professional occupations versus those who work (fulltime or part-time) in lower social class occupations (A094r) in the full sample (i.e. living in all regions in the UK).

Answer:

Table 3 below indicates the tabulated R-output for both summary statistics and completed t-test. The average gross weekly house income for those household where the reference person work in higher managerial, administrative, and professional occupation =940.80 (SD=271.21) while for those working in lower class occupation (routine and manual occupations) = 611.62 (SD= 289.82).

Tabulate R-output:

Table 3: T-test for gross normal weekly household income

welch two sample t-test

Higher class occupations

Lower class occupations

mean

£940.80

£611.62

SD

£271.21

£289.82

n

1457

867

t-stat

27.117

df

1725.1

p-value

0.000

95% CI for difference

Lower limit

305.37

Upper limit

352.99

6) Is there a statistically significant difference in mean gross normal weekly household income (P344pr) between those who work in higher managerial, administrative and professional occupations versus those who work in lower social class occupations (A094r)? State a null hypothesis and alternative hypothesis. Explain why you chose this test and whether the data meet the assumptions to conduct the test.

Answer:

Statement of hypothesis:

Null hypothesis: The mean difference of gross normal weekly household income between those who work (full-time or part-time) in higher managerial, administrative and professional occupations and those in lower social class occupations is zero.

Alternative hypothesis: The mean difference of gross normal weekly household income between those who work (full-time or part-time) in higher managerial, administrative and professional occupations and those in lower social class occupations is significantly greater than zero.

The data meets the t-test criterion because the dependent variable is continuous with a ratio level of measurement and the independent variable is a dichotomous nominal variable. The two groups are independent of the each other and the observations are also independent. However, the Shapiro-Wilk test for normality has W=0.90217, p-value < 0.01, which implies that the gross normal weekly household income does not meet the normality assumption.

The results from Table 3 above indicates t =27.117, DF = 1725.1, p-value <0.05, which leads to the rejection of null hypothesis that the mean difference between the two level of occupation is zero. In conclusion, there is statistically significant mean difference fpor gross normal weekly household between those (full part or part-time) referenced persons working higher managerial, administrative and professional occupations and those working in lower class occupations.

PART C: Correlation and linear regression

1) State a research hypothesis on the relationship between P550tpr and P344pr. Give a brief explanation as to why you would expect this hypothesis.

Answer:

Research hypothesis: There is a significant positive relationship between total weekly household expenditure and the gross normal weekly household income.

In reality, it is expected that a person’s income should reflect on their expenditure. By taking at the general life experiments or experiences, those who earn more tend to spend more while those earn less tend to spend less. Therefore, expenditure should be proportionate to the income.

2) Report the correlation between P550tpr and P344pr. Graphically display and statistically test the relationship between these variables. Interpret your results both statistically and substantively.

Answer:

The two variables are continuous with ratio of level of measurement. Therefore, a scatterplot will be an appropriate tool for visual depiction of the relationship between total weekly expenditure-the dependent variable and gross normal weekly income –independent variable.

Figure 5:Scatterplot for total weekly expenditure and gross normal weekly income

Figure 5 above is scatter diagram for the relationship between gross normal weekly income and total weekly expenditure. The scatter diagram indicate there is a linear positive association between the weekly expenditure and the weekly gross normal income. In others words, it is evident from the diagram that an increase gross normal weekly income is associated with an increase in the weekly expenditure. Since the relationship is linear, a correlation test was conducted to investigate indeed, whether there is significant relationship between the two variables. The Pearson r = 0.71, p-value < 0.05, which implies there is significantly strong positive correlation between total weekly expenditure and gross normal weekly income for households across the UK.

3) Estimate and present output from a simple regression model using P550tpr as the dependent variable and P344pr as the explanatory variable.

Answer:

Since the correlation coefficient test has indicate there is a significant relationship between the household weekly expenditure and gross income, the data qualifies for the causation analysis. In this case, a simple linear regression model was conducted to determine whether gross normal weekly income has significant effect on the total weekly expenditure. The hypotheses tested are stated below.

Null hypothesis: There is no significant effect of gross normal weekly income on the total weekly expenditure.

Alternative hypothesis: There is a significant effect of gross normal weekly income on the total weekly expenditure

The R-output is represented in the Table 4 below. The information from the summary table is useful in developing a linear regression equation that depicts the relationships between the two variables as one follows:

Where y is the dependent variable (Total weekly household expenditure) and x is explanatory variable (Gross normal weekly household income).

Table 4: simple linear regression summary output

ANOVA summary

DF

SS

MS

F

P

Regression

1

219396340

219396340

5123

<2e-16

Error

5142

220214057

42827

Coefficients

Coefficient

Intercept

122.9632

(Gross Income)

0.5751

4) Check for heteroskedasticity in your model using a plot and an appropriate post-estimation function. Interpret the results and correct you model, if necessary.

Answer:

One of the main assumption in simple linear regression analysis is that the error-variance is normally distributed. It is required for the estimate regression to meet this assumption to ensure robustness of the estimates. Figure 6 indicate there is constant error-variance in the data which implies that heteroskedasticity is not a problem.

Figure 6: plot for residuals against fitted values

5) Interpret your model statistically and substantively drawing on your hypothesis above.

Answer:

The model summary provide in Table 4 indicates an ANOVA F (1, 5142) =5132, p-value < 0.05, which implies that the null hypothesis should be rejected. In conclusion, the results indicates that gross normal weekly income is a significant predictor for the total weekly household expenditure. The model indicates that for every pound increase in gross weekly income there is a £0.5751 unit increase in the total week expenditure in UK.

6) Estimate expenditure when the value of your explanatory variable is £1,200. Indicate why it may not be appropriate to use your model to make this prediction.

Answer:

The estimate for expenditure give an income of £1,200 is £ 813.0832 on total weekly household expenditure. However, it may not be good to use this model for predicting total weekly expenditure since it preempts the effect of any other source of income as possible predictors.

7) Comment on the limitations of your model and whether you can infer causality.

Answer:

The only limitations of the model is that focuses on the gross normal weekly income as the only source of income predicting weekly expenditure. However, there is enough evidence model indicating that the gross normal income will affect the expenditure.

Appendices

R-script.

LCF2013 <- read.csv("~/LCF2013.csv")

View(LCF2013)

#PART A:Descriptive statistics

#question2

install.packages("psych")

library(psych)

describe(LCF2013$P550tpr)# calculating describe statistics for the total weekly household expenditure

#pie chart for home tenure

hometenure=table(LCF2013$A121r)

lbls <- paste(names(hometenure),"\n",hometenure, sep="")

pie(hometenure, labels = lbls,main="Pie Chart of home tenure\n (with sample sizes)")

#pie chart for home tenure

incomesource=table(LCF2013$P425r)

lbls1 <- paste(names(incomesource),"\n",incomesource, sep="")

pie(incomesource, labels = lbls1,main="Pie Chart of main source of income\n (with sample sizes)")

#bar chart for the number of children in the household

category=table(LCF2013$G019r)

barplot(category, main="children Distribution",xlab="Number of children")

#question3

# Boxplot of P344pr by Gorx

boxplot(P344pr~Gorx,data=LCF2013, main="Gross normal weekly household income",xlab="Government office region", ylab="Gross weekly income")

#####################################################################################

#Part B: Inferential Statistics: Confidence intervals, chi square and t-tests

#Question1: Confidence interval for total weekly household expenditure

smean=mean(LCF2013$P550tpr)# calculating the sample mean

smean

df=length(LCF2013$P550tpr)-1

df

Error=qt(0.975,df)*sd(LCF2013$P550tpr)/sqrt(length(LCF2013$P550tpr))# calculating the margin of error

Error

LLE=smean-Error# lower limit of 95% CI for total weekly household expenditure

LLE

ULE=smean+Error# lower limit of 95% CI for total weekly household expenditure

ULE

#question2: confidence intetval for "Economically inactive" proportion

p=sum(LCF2013$A093r == "Economically inactive")/length(LCF2013$A093r)

p #point estimate

SE=sqrt(p*(1-p)/length(LCF2013$A093r))

SE # standard error

ME=qnorm(.995)*SE #margin of error

ME

pCI=c((p-ME),(p+ME))

pCI

#question 3: Crosstabulating between G018r and A121r for those living in the North West and Merseyside region

newdata=subset(LCF2013,Gorx=="North West and Merseyside", select=c(casenew,G018r,A121r))#subsetting the original dataset

crosstab=table(newdata$G018r,newdata$A121r)#cross tabulation of the data

crosstab

chisq.test(crosstab)# test of independence

#queston 4:Testing the strength of the associaton

library(pwr)

effect=ES.w2(crosstab/585)# calculating the effect size of probabilities table for the contigency table

effect

#question5: comparing weekly household income

income1=subset(LCF2013,A093r=="Full-time working"|A093r=="Part-time working",select =c(casenew,A094r,P344pr)) #subsetting using economic position of the reference person

income2=subset(income1,A094r=="Higher managerial, administrative and professional occupations"|A094r=="Routine and manual occupations",select =c(casenew,A094r,P344pr))#stage 2 subsetting to only remain with only higher and lower class occupations

describe.by(income2$P344pr,income2$A094r)# summary statistics for income grouped by occupation.

shapiro.test(income2$P344pr) # test for normality of the dependent variable

t.test(income2$P344pr~income2$A094r,level=0.95,alternative="two.sided")

######################################################################################################################

#PART C: COrrelation and Regression analysis

#Question 2

regdata = data.frame(LCF2013$P550tpr,LCF2013$P344pr)# creating a dataframe for correlation analysis

corr.test(regdata,y=NULL, use="pairwise", alpha=0.05, method="pearson", ci=TRUE)# correlation analysis

plot(LCF2013$P344pr,LCF2013$P550tpr, main="Scatterplot for Income and expenditure",xlab="Gross normal weekly household income", ylab = "Total weekly household expenditure",col="red",pch=19)

# question 3

fit=lm(LCF2013$P550tpr~LCF2013$P344pr)#simple regression

fit # regressin equation/coefficients

aov(fit)

summary.aov(fit)# summary of regression ANOVA

#question 4

plot(fit$residuals,fit$fitted.values) # testing for heteroscedasticity

# question 6

y=122.9632+0.5751*(1200)

y # predicted value