Lab 3 - THE STATISTICAL SOMMELIER

Subject: BAN - Business Analytics Date: September 19th, 2018 Teacher: Marta Fernández Diego Student: Fábio Santos Lobão

Table Header Second Header
Table Cell Cell 2
Cell 3 Cell 4

Introduction

Video 4 - The Regression Model

In R, use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables. Using the summary output of this model, answer the following questions:

DFWine <- read.csv("./wine.csv")
cat('STR\n')
## STR
str(DFWine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...
cat('\nSummary\n')
## 
## Summary
summary(DFWine)
##       Year          Price         WinterRain         AGST      
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98  
##  1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20  
##  Median :1966   Median :7.121   Median :600.0   Median :16.53  
##  Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51  
##  3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07  
##  Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65  
##   HarvestRain         Age         FrancePop    
##  Min.   : 38.0   Min.   : 5.0   Min.   :43184  
##  1st Qu.: 89.0   1st Qu.:11.0   1st Qu.:46584  
##  Median :130.0   Median :17.0   Median :50255  
##  Mean   :148.6   Mean   :17.2   Mean   :49694  
##  3rd Qu.:187.0   3rd Qu.:23.0   3rd Qu.:52894  
##  Max.   :292.0   Max.   :31.0   Max.   :54602

1.1 - Multiple R-squared

What is the “Multiple R-squared” value of your model?

Model = lm(Price ~HarvestRain + WinterRain, data=DFWine)

summary(Model)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = DFWine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0933 -0.3222 -0.1012  0.3871  1.1877 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
## HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
## WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.2557 
## F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492
cat('Model using harvest and winter rain as IV:\n
         Multiple R.square:',summary(Model)$r.square,'\n')
## Model using harvest and winter rain as IV:
## 
##          Multiple R.square: 0.3177024

1.2 - HarvestRain coefficient

What is the coefficient for HarvestRain?

cat('Model using harvest and winter rain as IV:\n
         Coefficiente for \"HarvestRain\":',summary(Model)$coefficients[2],'\n')
## Model using harvest and winter rain as IV:
## 
##          Coefficiente for "HarvestRain": -0.004971004

1.3 - intercept coefficient

What is the intercept coefficient?

cat('Model using harvest and winter rain as IV:\n
         Intercept coefficient:',summary(Model)$coefficients[1],'\n')
## Model using harvest and winter rain as IV:
## 
##          Intercept coefficient: 7.865325

Video 5 -

Use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables, like you did in the previous quick question. Using the summary output of this model, answer the following questions:

2.1 - HarvestRain significant

Is the coefficient for HarvestRain significant?

cat('Model using harvest and winter rain as IV:\n
         \"HarestRain\" significance is',summary(Model)$coefficients[,4]['HarvestRain'],'. Since it`s lower than the accepted 0,05 upper limit, it can be considered significant\n')
## Model using harvest and winter rain as IV:
## 
##          "HarestRain" significance is 0.005162706 . Since it`s lower than the accepted 0,05 upper limit, it can be considered significant

2.2 - WinRain Significance

Is the coefficient for WinterRain significant?

cat('Model using harvest and winter rain as IV:\n
         \"HarestRain\" significance is',summary(Model)$coefficients[,4]['WinterRain'],'. Since it`s higher than the accepted 0,05 upper limit, it can be considered non significant\n')
## Model using harvest and winter rain as IV:
## 
##          "HarestRain" significance is 0.913921 . Since it`s higher than the accepted 0,05 upper limit, it can be considered non significant

Video 6 - Checking Correlation

3.1 - Correlation between HarvestRain and WinterRain

Using the data set wine.csv, what is the correlation between HarvestRain and WinterRain?

cat('Correlation between HarvestRain and WinterRain is',cor(DFWine$HarvestRain, DFWine$WinterRain))
## Correlation between HarvestRain and WinterRain is -0.2754409

Video 7 -

2.1 Testing Model

Which of the following are NOT valid values for an out-of-sample (test set) R² ? Select all that apply. (-7.0, -0.3, 0.0, 0.6, 1.0, 2.4)

Answer: R square can not be higher than 1 since the ratio os squared errors is always a positive number. It may be negative if the model perform much worst than the baseline with the test set than the original training set.

# Computing R square for the model

DFWineTest <- read.csv("./wine_test.csv")

str(DFWineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110
Model = lm(Price ~AGST + HarvestRain + WinterRain + Age, data = DFWine)

PredictTest = predict(Model,newdata = DFWineTest) 

PredictTest
##        1        2 
## 6.768925 6.684910
SSE = sum((DFWineTest$Price-PredictTest)^2)
SST = sum((DFWineTest$Price-mean(DFWine$Price))^2)

r.square = 1 - (SSE/SST)

Conclusion

Model = lm(Price ~HarvestRain + WinterRain, data=DFWine)
SSE = sum(Model$residuals^2)
cat('Model using harvest and winter rain as IV:\n
         R.square:',summary(Model)$r.square,
'\n         SSE:     ',SSE,
'\n         SSE:     ',SSE,'\n\n')
## Model using harvest and winter rain as IV:
## 
##          R.square: 0.3177024 
##          SSE:      6.925756 
##          SSE:      6.925756
summary(Model)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = DFWine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0933 -0.3222 -0.1012  0.3871  1.1877 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
## HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
## WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.2557 
## F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492
summary(Model)$coefficients[3]
## [1] -9.848197e-05
Model =lm(Price ~ AGST, data = DFWine)
SSE = sum(Model$residuals^2)
cat('Model using only AGST as IV:\n    R.square:',summary(Model)$r.square,'\n    SSE:     ',SSE,'\n\n')
## Model using only AGST as IV:
##     R.square: 0.4350232 
##     SSE:      5.734875
length(Model$residuals)
## [1] 25
Model = lm(Price ~AGST + HarvestRain + WinterRain + Age + FrancePop, data = DFWine)
SSE = sum(Model$residuals^2)
cat('Model using AGST, Harvest Rain, WinterRain, Age and France Population as IV:\n    R.square:',summary(Model)$r.square,'\n    SSE:     ',SSE,'\n')
## Model using AGST, Harvest Rain, WinterRain, Age and France Population as IV:
##     R.square: 0.8293592 
##     SSE:      1.732113
cor(DFWine$WinterRain, DFWine$Price)
## [1] 0.1366505
cor(DFWine$Age, DFWine$FrancePop)
## [1] -0.9944851
correlation = cor(DFWine)


correlation[abs(correlation[])>0.1]
##  [1]  1.0000000 -0.4477679 -0.2469159 -1.0000000  0.9944851 -0.4477679
##  [7]  1.0000000  0.1366505  0.6595629 -0.5633219  0.4477679 -0.4668616
## [13]  0.1366505  1.0000000 -0.3210906 -0.2754409 -0.2469159  0.6595629
## [19] -0.3210906  1.0000000  0.2469159 -0.2591623 -0.5633219 -0.2754409
## [25]  1.0000000 -1.0000000  0.4477679  0.2469159  1.0000000 -0.9944851
## [31]  0.9944851 -0.4668616 -0.2591623 -0.9944851  1.0000000
# remove all variables
rm(list=ls())