Subject: BAN - Business Analytics Date: September 19th, 2018 Teacher: Marta Fernández Diego Student: Fábio Santos Lobão
Table Header | Second Header |
---|---|
Table Cell | Cell 2 |
Cell 3 | Cell 4 |
In R, use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables. Using the summary output of this model, answer the following questions:
DFWine <- read.csv("./wine.csv")
cat('STR\n')
## STR
str(DFWine)
## 'data.frame': 25 obs. of 7 variables:
## $ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
## $ Price : num 7.5 8.04 7.69 6.98 6.78 ...
## $ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
## $ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
## $ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
## $ Age : int 31 30 28 26 25 24 23 22 21 20 ...
## $ FrancePop : num 43184 43495 44218 45152 45654 ...
cat('\nSummary\n')
##
## Summary
summary(DFWine)
## Year Price WinterRain AGST
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98
## 1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20
## Median :1966 Median :7.121 Median :600.0 Median :16.53
## Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51
## 3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07
## Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65
## HarvestRain Age FrancePop
## Min. : 38.0 Min. : 5.0 Min. :43184
## 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
## Median :130.0 Median :17.0 Median :50255
## Mean :148.6 Mean :17.2 Mean :49694
## 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
## Max. :292.0 Max. :31.0 Max. :54602
What is the “Multiple R-squared” value of your model?
Model = lm(Price ~HarvestRain + WinterRain, data=DFWine)
summary(Model)
##
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = DFWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0933 -0.3222 -0.1012 0.3871 1.1877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.865e+00 6.616e-01 11.888 4.76e-11 ***
## HarvestRain -4.971e-03 1.601e-03 -3.105 0.00516 **
## WinterRain -9.848e-05 9.007e-04 -0.109 0.91392
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared: 0.3177, Adjusted R-squared: 0.2557
## F-statistic: 5.122 on 2 and 22 DF, p-value: 0.01492
cat('Model using harvest and winter rain as IV:\n
Multiple R.square:',summary(Model)$r.square,'\n')
## Model using harvest and winter rain as IV:
##
## Multiple R.square: 0.3177024
What is the coefficient for HarvestRain?
cat('Model using harvest and winter rain as IV:\n
Coefficiente for \"HarvestRain\":',summary(Model)$coefficients[2],'\n')
## Model using harvest and winter rain as IV:
##
## Coefficiente for "HarvestRain": -0.004971004
What is the intercept coefficient?
cat('Model using harvest and winter rain as IV:\n
Intercept coefficient:',summary(Model)$coefficients[1],'\n')
## Model using harvest and winter rain as IV:
##
## Intercept coefficient: 7.865325
Use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables, like you did in the previous quick question. Using the summary output of this model, answer the following questions:
Is the coefficient for HarvestRain significant?
cat('Model using harvest and winter rain as IV:\n
\"HarestRain\" significance is',summary(Model)$coefficients[,4]['HarvestRain'],'. Since it`s lower than the accepted 0,05 upper limit, it can be considered significant\n')
## Model using harvest and winter rain as IV:
##
## "HarestRain" significance is 0.005162706 . Since it`s lower than the accepted 0,05 upper limit, it can be considered significant
Is the coefficient for WinterRain significant?
cat('Model using harvest and winter rain as IV:\n
\"HarestRain\" significance is',summary(Model)$coefficients[,4]['WinterRain'],'. Since it`s higher than the accepted 0,05 upper limit, it can be considered non significant\n')
## Model using harvest and winter rain as IV:
##
## "HarestRain" significance is 0.913921 . Since it`s higher than the accepted 0,05 upper limit, it can be considered non significant
Using the data set wine.csv, what is the correlation between HarvestRain and WinterRain?
cat('Correlation between HarvestRain and WinterRain is',cor(DFWine$HarvestRain, DFWine$WinterRain))
## Correlation between HarvestRain and WinterRain is -0.2754409
Which of the following are NOT valid values for an out-of-sample (test set) R² ? Select all that apply. (-7.0, -0.3, 0.0, 0.6, 1.0, 2.4)
Answer: R square can not be higher than 1 since the ratio os squared errors is always a positive number. It may be negative if the model perform much worst than the baseline with the test set than the original training set.
# Computing R square for the model
DFWineTest <- read.csv("./wine_test.csv")
str(DFWineTest)
## 'data.frame': 2 obs. of 7 variables:
## $ Year : int 1979 1980
## $ Price : num 6.95 6.5
## $ WinterRain : int 717 578
## $ AGST : num 16.2 16
## $ HarvestRain: int 122 74
## $ Age : int 4 3
## $ FrancePop : num 54836 55110
Model = lm(Price ~AGST + HarvestRain + WinterRain + Age, data = DFWine)
PredictTest = predict(Model,newdata = DFWineTest)
PredictTest
## 1 2
## 6.768925 6.684910
SSE = sum((DFWineTest$Price-PredictTest)^2)
SST = sum((DFWineTest$Price-mean(DFWine$Price))^2)
r.square = 1 - (SSE/SST)
Model = lm(Price ~HarvestRain + WinterRain, data=DFWine)
SSE = sum(Model$residuals^2)
cat('Model using harvest and winter rain as IV:\n
R.square:',summary(Model)$r.square,
'\n SSE: ',SSE,
'\n SSE: ',SSE,'\n\n')
## Model using harvest and winter rain as IV:
##
## R.square: 0.3177024
## SSE: 6.925756
## SSE: 6.925756
summary(Model)
##
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = DFWine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0933 -0.3222 -0.1012 0.3871 1.1877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.865e+00 6.616e-01 11.888 4.76e-11 ***
## HarvestRain -4.971e-03 1.601e-03 -3.105 0.00516 **
## WinterRain -9.848e-05 9.007e-04 -0.109 0.91392
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared: 0.3177, Adjusted R-squared: 0.2557
## F-statistic: 5.122 on 2 and 22 DF, p-value: 0.01492
summary(Model)$coefficients[3]
## [1] -9.848197e-05
Model =lm(Price ~ AGST, data = DFWine)
SSE = sum(Model$residuals^2)
cat('Model using only AGST as IV:\n R.square:',summary(Model)$r.square,'\n SSE: ',SSE,'\n\n')
## Model using only AGST as IV:
## R.square: 0.4350232
## SSE: 5.734875
length(Model$residuals)
## [1] 25
Model = lm(Price ~AGST + HarvestRain + WinterRain + Age + FrancePop, data = DFWine)
SSE = sum(Model$residuals^2)
cat('Model using AGST, Harvest Rain, WinterRain, Age and France Population as IV:\n R.square:',summary(Model)$r.square,'\n SSE: ',SSE,'\n')
## Model using AGST, Harvest Rain, WinterRain, Age and France Population as IV:
## R.square: 0.8293592
## SSE: 1.732113
cor(DFWine$WinterRain, DFWine$Price)
## [1] 0.1366505
cor(DFWine$Age, DFWine$FrancePop)
## [1] -0.9944851
correlation = cor(DFWine)
correlation[abs(correlation[])>0.1]
## [1] 1.0000000 -0.4477679 -0.2469159 -1.0000000 0.9944851 -0.4477679
## [7] 1.0000000 0.1366505 0.6595629 -0.5633219 0.4477679 -0.4668616
## [13] 0.1366505 1.0000000 -0.3210906 -0.2754409 -0.2469159 0.6595629
## [19] -0.3210906 1.0000000 0.2469159 -0.2591623 -0.5633219 -0.2754409
## [25] 1.0000000 -1.0000000 0.4477679 0.2469159 1.0000000 -0.9944851
## [31] 0.9944851 -0.4668616 -0.2591623 -0.9944851 1.0000000
# remove all variables
rm(list=ls())