Project - Elantra Car Sales

Abstract

On this project we applied the knowledge acquired during the present course to create linear models that will enable us to predict the amount of sales of the Hyundai Elantra cars. The exercise allowed us to explore other features of the R language, including the use of libraries such as GGPLOT2, Corrplot, DPLYR and KableExtra, used to manipulate, analyse and better present the results.

The dataset supplied includes 7 variables and 50 observations. Due to the small size of the dataset, several assumptions such as normality cannot be rigorously evaluated with numerical methods. Using histograms and graphical representations we were able to ascertain that the data is not normally distributed.

After data transformations to enable the proper use of de month and year information, originally stored into two categorical variables, it was possible to identify specific months of the year when there are more sales than others, most likely due to events such as the release of new models. This periodicity allowed us to create a new categorical variable that was associated with what we called “Sales seasons”. This variable assumed 3 values: “High”, “Mid” and “Low”. This allowed to group months and thus increase the number of observations on each group.

Several experiments were performed using different linear models. 7 were examined in more detail and presented in this report. These models apply the existing variables and rationale of the motor car market, exploring different alternatives of combining the variables with the time reference as described by “months” and “season”. In some cases, stepwise optimization was used to reduce the number of variables.

These seven models were evaluated using the last 12 months as a test dataset and the first 38 months as a training dataset. The comparison used boxplots of the error, the R square and root mean square error obtained over the test dataset.

From this first test, two models were selected for further evaluation. This second evaluation applied cross-validation method to improve the reliability of the test. In this method, the dataset was split into blocks of 5 months (one observation for each month). One block was used for testing and the remaining 9 blocks for training. This allowed us to perform 10 trials on each of the pre-selected models.

One of the models, the one that uses the season variable presented less error and was considered the best model. This model predict that sales are negatively related to the unemployment rate and is affected by sales season that are created by externalities, such as the release of new car design. The model is of simple application and understanding.

During test performed the proposed model incurred in errors that varied from about -7000 up to about 4300 sales, with RMSE of about 4800 units. Considering all evaluations, errors ranged from about +28% to -84%, but with the first and third quartiles, representing the most probable scenarios, ranging from about -12% to +11%. We consider these errors remarkably high and further studies are needed to incorporate in the model other variables, less correlated with the existing ones, or applied other estimation methods, more suitable to the current example.

Introduction

As an opening remark, all text presented such as this one, in larger font size and highlighted by a grey bar on the left side, is an original text for this essay. Other text, presented in small letters such as the follows on this introduction, are quotations from the original exercise proposition and were kept here as reference.

An important application of linear regression is understanding sales. Consider a company that produces and sells a product. In a given period, if the company produces more units than how many consumers will buy, the company will not earn money on the unsold units and will incur additional costs due to having to store those units in inventory before they can be sold. If it produces fewer units than how many consumers will buy, the company will earn less than it potentially could have earned. Being able to predict consumer sales, therefore, is of first order importance to the company.

In this problem, we will try to predict monthly sales of the Hyundai Elantra in the United States. The Hyundai Motor Company is a major automobile manufacturer based in South Korea. The Elantra is a car model that has been produced by Hyundai since 1990 and is sold all over the world, including the United States. We will build a linear regression model to predict monthly sales using economic indicators of the United States as well as Google search queries.

The file elantra.csv contains data for the problem. Each observation is a month, from January 2010 to February 2014. For each month, we have the following variables:

Step 1. Problem definition

The first step is to of course understand the business problem. What is the problem you are trying to solve - what is the business context? Very often however your client may also just give you a whole lot of data and ask you to do something with it. In such a case you would need to take a more exploratory look at the data. Nevertheless, if the client has a specific problem that needs to be tackled, then the first step is to clearly define and understand the problem. You will then need to convert the business problem into an analytics problem. In other words, you need to understand exactly what you are going to predict with the model you build. There is no point in building a fabulous model, only to realise later that what it is predicting is not exactly what the business needs.

In short, we can state the business problem as: “the client wants to predict the future amount of sales (ElantraSales) based on the past values and a given dataset”.

An important aspect is that car sales are much affected by new releases and other externalities, that usually have a periodicity of one year, one should expect to see yearlong cycles on sales and such reference is a key factor in the analysis.

Step. Data Exploration

Once you have the problem defined, the next step is to explore the data and become more familiar with it. This is especially important when dealing with a completely new data set.

Firstly, lets load the libraries to be used and set up environment variables

# libraries to plot with GGPlot
library(ggplot2)
library(reshape2)
library(scales)
library(ggpubr)

# libraries to plot the correlation map (visually more interesting than GGPlot)
library(corrplot)

# libraries used to normalize data
library(dplyr)

# library used to pretty printing tables
library(DT)
library(data.table)

library(kableExtra)
library(lemon)

knit_print.data.frame <- lemon_print
knit_print.table <- lemon_print
knit_print.grouped_df <- lemon_print # enableds dplyr results
knit_print.tibble <- lemon_print
knit_print.tbl <- lemon_print

Than let’s read the data and get a quick view on the variables as presented on tables 1 and 2

DFElantra <- read.csv("./elantra.csv")

FootNote = sprintf("Number of observations: %d",nrow(DFElantra)) 

kable(data.frame(Variable = names(DFElantra),
              Classe = sapply(DFElantra, typeof),
              First_Values = sapply(DFElantra, function(x) paste0(head(x),  collapse = ", ")),
              row.names = NULL),
      format = "html",
      caption = "Table 1: STR for the complete Elantra Dataset",
      table.attr = "style = \"color: black; font-size: 11pt;\"") %>%
      kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
      footnote(general = FootNote, general_title = "")

kable(data.frame(unclass(summary(DFElantra)),
                check.names = FALSE,
                stringsAsFactors = FALSE,
                row.names = NULL),
      format = "html",
      caption = "Table 2: Summary of complete Elantra Dataset", 
      table.attr = "style = \"color: black; font-size: 11pt;\"") %>%
      kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) 
Table 1: STR for the complete Elantra Dataset
Variable Classe First_Values
Month integer 1, 1, 1, 1, 1, 2
Year integer 2010, 2011, 2012, 2013, 2014, 2010
ElantraSales integer 7690, 9659, 10900, 12174, 15326, 7966
Unemployment double 9.7, 9.1, 8.2, 7.9, 6.6, 9.8
Queries integer 153, 259, 354, 230, 232, 130
CPI_energy double 213.377, 229.353, 244.178, 242.56, 247.575, 209.924
CPI_all double 217.466, 221.082, 227.666, 231.321, 234.933, 217.251
Number of observations: 50
Table 2: Summary of complete Elantra Dataset
Month Year ElantraSales Unemployment Queries CPI_energy CPI_all
Min. : 1.0 Min. :2010 Min. : 7690 Min. :6.600 Min. :130.0 Min. :204.2 Min. :217.3
1st Qu.: 3.0 1st Qu.:2011 1st Qu.:12560 1st Qu.:7.725 1st Qu.:224.8 1st Qu.:230.1 1st Qu.:221.3
Median : 6.0 Median :2012 Median :15624 Median :8.250 Median :262.5 Median :244.4 Median :227.9
Mean : 6.3 Mean :2012 Mean :16005 Mean :8.422 Mean :263.5 Mean :236.9 Mean :226.7
3rd Qu.: 9.0 3rd Qu.:2013 3rd Qu.:19197 3rd Qu.:9.100 3rd Qu.:311.0 3rd Qu.:247.1 3rd Qu.:231.7
Max. :12.0 Max. :2014 Max. :26153 Max. :9.900 Max. :427.0 Max. :256.4 Max. :235.2

Looking at the data, something that promptly call ones attention is that the Dataset is very small, if we consider the in separate the twelve months in order to abstract year cycles of the car industry, one can get less than 5 samples for each month, even less if we take one year out of the training dataset for testing purposes.

Another important feature that one can notice from the data summary is that there are no missing values. To most of the variables, Mean and Median values are fairly close together, which means that the data is evenly spread within its maximum and minimum, with no distortions such as high skewness.

The sales information is defined in the numeric variable “ElantraSales”, that shall be our dependant variable. The information will be our independent variables. They all are also numeric, except “Month” and “Year”, that may be considered categorical in the format presented. With this basic understanding of how the data is organized, we may initially propose the use of a linear model and proceed with the data preparation towards that direction.

Step 3. Data Preparation

Now that you have a good understanding of the data, you will need to prepare it for modelling. You will identify and treat missing values, detect outliers, transform variables, create binary variables if required and so on. This stage is very influenced by the modelling technique you will use at the next stage. For example, regression involves a fair amount of data preparation, but decision trees may need less preparation whereas clustering requires a whole different kind of preparation as compared to other techniques.

Considering the small size of the dataset, normality tests and other numerical technics might be misleading. So, we proceed to visually explore the data through a series of graphs.

Firstly, let`s look at the histograms (figures 1 to 6). The year variable was not evaluated since we are dealing with a continuous period and the histogram distribution will be much like the one presented to the month variable.

# set up a clean theme
ggplot_theme = theme_set(theme_classic()) 

# plot histograms to evaluate normality
histo01 = ggplot(DFElantra, aes(ElantraSales)) +
  geom_histogram(show.legend = FALSE, colour = "white", fill = "gray", bins = 10) +
  labs(x = "Elantra Sales", y = "Count") +
  ggtitle("Figure 1: Histogram for the ElantraSales Variable")
histo02 = ggplot(DFElantra, aes(Unemployment)) +
  geom_histogram(show.legend = FALSE, colour = "white", fill = "gray", bins = 10) +
  labs(x = "Unemployment", y = "Count") +
  ggtitle("Figure 2: Histogram for the Unemployment variable")
histo03 = ggplot(DFElantra, aes(Queries)) +
  geom_histogram(show.legend = FALSE, colour = "white", fill = "gray", bins = 10) +
  labs(x = "Queries", y = "Count") +
  ggtitle("Figure 3: Histogram for the Queries variable")
histo04 = ggplot(DFElantra, aes(CPI_energy)) +
  geom_histogram(show.legend = FALSE, colour = "white", fill = "gray", bins = 10) +
  labs(x = "CPI_energy", y = "Count") +
  ggtitle("Figure 4: Histogram for the CPI_energy variable")
histo05 = ggplot(DFElantra, aes(CPI_all)) +
  geom_histogram(show.legend = FALSE, colour = "white", fill = "gray", bins = 10) +
  labs(x = "CPI_all", y = "Count") +
  ggtitle("Figure 5: Histogram for the CPI_all variable")
histo06 = ggplot(DFElantra, aes(Month)) +
  geom_histogram(show.legend = FALSE, colour = "white", fill = "gray", bins = 12) +
  labs(x = "Month", y = "Count") +
  ggtitle("Figure 6: Histogram for the Month variable")

histo01; histo02; histo03; histo04; histo05; histo06

From the histograms, one can see that the variables are not normally distributed, which means that tests such as r2 might be misleading.

Let’s proceed searching for patterns on the time scale, as would be expected due to the nature of the motor car market.

To have a better understanding of the whole dataset over the timespan, lets create a new variable in date format that will allow us to reorder and split the dataset based the date information.

To perform this analysis, we first need to normalize the data in order to allow us to plot several curves on the same scale.

The result is presented on the following figures.

# Since Date format needs the day information to work, lets use the first day of the month as reference.
DFElantra$RefDate = as.Date(strptime(sprintf('1/%d/%d',DFElantra$Month,DFElantra$Year), "%d/%m/%Y"))

# Normalize data to evaluate temporal behavior
DFNormalized = DFElantra %>% mutate_each_(funs(scale(.) %>% as.vector), vars=c("ElantraSales", "Unemployment", "Queries", "CPI_energy", "CPI_all"))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over a selection of variables, use `mutate_at()`
# Create Dataframes that will allow us to use ggplot to draw multiple variables into a single plot, using the time scale as reference.
DFMUnemployment = melt(data = DFNormalized, id.vars = "RefDate", measure.vars = c("ElantraSales", "Unemployment"))
DFMQueries = melt(data = DFNormalized, id.vars = "RefDate", measure.vars = c("ElantraSales", "Queries"))
DFMEnergy = melt(data = DFNormalized, id.vars = "RefDate", measure.vars = c("ElantraSales", "CPI_energy"))
DFMAll = melt(data = DFNormalized, id.vars = "RefDate", measure.vars = c("ElantraSales", "CPI_all"))

plot01 = ggplot(DFMUnemployment, aes(x=RefDate, y=value, colour = variable, na.rm=TRUE)) + 
  geom_line() + 
  geom_point() +
  stat_summary(fun.y = "median", geom = "point", size = 3) +
  geom_smooth(method='lm')+
  ggtitle("Figure 7: Relation of Unemployment and ElantraSales with time")

plot02 = ggplot(DFMQueries, aes(x=RefDate, y=value, colour = variable)) + 
  geom_line() + 
  geom_point() +
  stat_summary(fun.y = "median", geom = "point", size = 3) +
  geom_smooth(method='lm') +
  ggtitle("Figure 8: Relation of Queries and ElantraSales with time")

plot03 = ggplot(DFMEnergy, aes(x=RefDate, y=value, colour = variable)) + 
  geom_line() + 
  geom_point() +
  stat_summary(fun.y = "median", geom = "point", size = 3) +
  geom_smooth(method='lm') +
  ggtitle("Figure 9: Relation of CPI_Energy and ElantraSales with time")

plot04 = ggplot(DFMAll, aes(x=RefDate, y=value, colour = variable)) + 
  geom_line() + 
  geom_point() +
  stat_summary(fun.y = "median", geom = "point", size = 3) +
  geom_smooth(method='lm') +
  ggtitle("Figure 10: Relation of CPI_All and ElantraSales with time")

# Remove unecessary variables
rm(DFNormalized, DFMAll,DFMEnergy,DFMQueries,DFMUnemployment)

plot01; plot02; plot03; plot04

Apart from the clear correlation of the variables with elantra sales. Taking the time variable, one can foresee a cyclic pattern with periodicity of one year, as expected.

To further evaluate the periodicity, let’s consider the Elantra sales variable over the whole dataset, aggregated into months, bimonths and quarters. (figures 11 to 13)

# Create quarter and bimonth maps
DFQuarterMap = data.frame(Month=1:12,Quarter=rep(1:4,each=3))
DFBimonthMap = data.frame(Month=1:12,Bimonth=rep(1:6,each=2))

# Merge maps with the dataset to create new columns
DFElantraExp = merge(DFElantra,DFQuarterMap)
DFElantraExp = merge(DFElantraExp,DFBimonthMap)

# Plot curves
plot05 = ggplot(DFElantraExp, aes(x=Month, y=ElantraSales)) + 
  geom_point(alpha = 0.3) +
  stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm') +
  scale_x_continuous(name="Months", limits=c(0, 13)) +
  ggtitle("Figure 11: ElantraSales over months")

plot06 = ggplot(DFElantraExp, aes(x=Quarter, y=ElantraSales)) + 
  geom_point(alpha = 0.3) +
  stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm') +
  ggtitle("Figure 12: ElantraSales over quarters")

plot07 = ggplot(DFElantraExp, aes(x=Bimonth, y=ElantraSales)) + 
  geom_point(alpha = 0.3) +
  stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm') +
  ggtitle("Figure 13: ElantraSales over bimonths")

# Remove unecessary variables
rm(DFQuarterMap,DFBimonthMap,DFElantraExp)

plot05; plot06; plot07

Especially considering the monthly graph, one can see “seasons”, when there are more sales (march to august), when there are less sales (January, February, October and November) and a season in between (September and December).

Since these seasons are not aligned with the bimonth or quarter definitions, let’s than create a categorical variable to describe these seasons into “high”, “mid” and “low” sales seasons.

# Create sales season maps
DFSeasonMap = data.frame(Month=1:12,Season=factor(c('Low','Low','High','High','High','High','High','High','Mid','Low','Low','Mid')))
# Merge season map with the dataset to create new columns
DFElantra = merge(DFElantra,DFSeasonMap)

DFElantra$Month <- as.factor(DFElantra$Month)
# Drop Month and Year variables that are not necessary for any further computation.
#drops = c("Month","Year")
drops = c("Year")
DFElantra = DFElantra[ , !(names(DFElantra) %in% drops)]

# Remove unecessary variables
rm(DFSeasonMap, drops)

We can check if our season definition is reasonable building a boxplot for each season, as presented on the figure 14

plot08 = ggplot(DFElantra, aes(x=Season, y=ElantraSales)) +
  geom_boxplot() +
  geom_point() +
  guides(fill = FALSE) +
  stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4) +
  ggtitle("Figure 13: ElantraSales over seasons")

plot08

From the boxplot we can see that points are evenly spread for each season with significant difference on the median value. The diamond shape point represents the average, that is general is quite close to the median, although the distributions are not normal.

From this plot we see no reason to drop any event as an outlier. Although some of the values are quite far from the second quartile, they still fall within the first and the third quartiles.

Considering the correlation of the variables, since there is a clear correlation with time, let’s create a time variable that is numerical in nature and may be used for correlation purposes.

# Create a Unix Time variable, a time reference described as numeric that can be used for correlation analysis.
DFElantra$Time = as.numeric(DFElantra$RefDate)

# For the sake of later presentation and easier computation, specially when testing different models, let's  reorder the DFElantra dataset to be in a time sequence.
DFElantra = DFElantra[order(DFElantra$RefDate),]

We may now proceed with a correlation analysis of our numerical variables, as presented on figure

Correlation = cor(DFElantra[ , !(names(DFElantra) %in% c("RefDate", "Month", "Season", "Year"))])
corrplot.mixed(Correlation, title="Figure 14: Correlation of the numererical variables",mar=c(0,0,1,0))

rm(Correlation)

Looking at the correlations, one may come to realize that although “ElantraSales” is correlated with the “Time”, Time itself is highly correlated with the “CPI_all” and “Unemployment”, such as that the introduction of this variables together is probably unnecessary and may compromise our model.

Although some may argue that the passage of time may reflect some network effect on increasing sales and the product natural evolution through a bathtub curve, it is important to notice that the Elantra model is a consolidated product and there is no causality in the correlation of time and the sales. As such, we will drop this variable from our model.

Something similar is present in the high correlation between “CPI_energy” and “CPI_all” and between “CPI_all”and “unemployment”. such combinations may also be left out of our test models by simply dropping the CPI_all variable.

Let us now look on the scatter plots in figures 15 to 17 for relation between the dependent variable “ElantraSales” in relation to each of the independent variables.

plot09 = ggplot(DFElantra, aes(x=Unemployment, y=ElantraSales)) + 
  geom_point() +
  geom_boxplot(aes(group = cut_width(Unemployment, 0.6))) +
  stat_summary(fun.y = "median", geom = "point", size = 3) +
  geom_smooth(method='lm') +
  ggtitle("Figure 15: ElantraSales relation to Unemployment")

plot10 = ggplot(DFElantra, aes(x=Queries, y=ElantraSales)) + 
  geom_point() +
  geom_boxplot(aes(group = cut_width(Queries, 60))) +
  stat_summary(fun.y = "median", geom = "point", size = 3) +
  geom_smooth(method='lm') +
  ggtitle("Figure 16: ElantraSales relation to Queries")

plot11 = ggplot(DFElantra, aes(x=CPI_energy, y=ElantraSales)) + 
  geom_point() +
  geom_boxplot(aes(group = cut_width(CPI_energy, 8))) +
  stat_summary(fun.y = "median", geom = "point", size = 3) +
  geom_smooth(method='lm') +
  ggtitle("Figure 17: ElantraSales relation to CPI_energy")

plot12 = ggplot(DFElantra, aes(x=CPI_all, y=ElantraSales)) + 
  geom_point() +
  geom_boxplot(aes(group = cut_width(CPI_all, 3))) +
  stat_summary(fun.y = "median", geom = "point", size = 3) +
  geom_smooth(method='lm') +
  ggtitle("Figure 18: ElantraSales relation to CPI_all")

plot09; plot10; plot11; plot12

From the graphs, one can see that the data is quite spread over both axis, as previously evaluated by the histograms, which makes it almost impossible to apply any transformation in order normalize the data. This limitation does not hinder the use of a linear model approximation, although it may compromise the use or r2 as a reliable measure of the model quality since that, in such case, the mean values is no longer a reliable baseline estimation.

On each of the graphs we included boxplots for about 6 clusters defined along the range of variation of the independent variable. These boxplots allow us to perceive how variables such as “Queries” are less clearly related to the “Elantra Sales”, since the points spread farther, outside the first and third quartiles. Since we have a such a small dataset, this result is not used to perform any suppression of outliers.

Each of the graphs visually present a linear model for each variable, lets now proceed to a create a multivariable linear model.

Step 4. Modelling

Once the data is prepared, you can begin modelling. This is usually an iterative process where you run a model, evaluate the results, tweak your approach, run another model, evaluate the results, re-tweak and so on. You go on doing this until you come up with a model you are satisfied with or what you feel is the best possible result with the given data.

At this step we are going to create a few models to predict sales. To evaluate these models, the dataset will be split into two using the last year of data, from 2013-03-01 to 2014-02-01 as the test dataset and the remaining data as the training dataset.

Effectively in this case, as discussed on the introduction, we are setting our “present” on the end of February 2013, and trying to envision the future for the following 12 months.

The best models will be considered the ones that provides less error on the test dataset and additionally, provide a good balance in terms of complexity, consistency with correlation matrix, statistical significance of the variables, etc.

An important condition for the present evaluation is that we are going to limit ourselves to linear models, due to the scope of the present course programme and task. For such evaluation, we are going to avoid the inclusion of highly correlated variables into the test models, as previously discussed when considering the correlation of the variables.

At first, let us create our training and test datasets.

DFElantraTrain = subset(DFElantra, RefDate < as.Date(strptime('25/02/2013', "%d/%m/%Y")))
DFElantraTrain$Season = relevel(DFElantraTrain$Season, "High")

FootNote = sprintf("Number of observations: %d",nrow(DFElantraTrain)) 

kable(data.frame(Variable = names(DFElantraTrain),
                 Classe = sapply(DFElantraTrain, typeof),
                 First_Values = sapply(DFElantraTrain, function(x) paste0(head(x),  collapse = ", ")),
                 row.names = NULL),
      format = "html",
      caption = "Table 3: STR for Elantra Training Dataset",
      table.attr = "style = \"color: black; font-size: 11pt;\"") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  footnote(general = FootNote, general_title = "")
Table 3: STR for Elantra Training Dataset
Variable Classe First_Values
Month integer 1, 2, 3, 4, 5, 6
ElantraSales integer 7690, 7966, 8225, 9657, 9781, 14245
Unemployment double 9.7, 9.8, 9.9, 9.9, 9.6, 9.4
Queries integer 153, 130, 138, 132, 177, 138
CPI_energy double 213.377, 209.924, 209.163, 209.024, 206.172, 204.161
CPI_all double 217.466, 217.251, 217.305, 217.376, 217.299, 217.285
RefDate double 2010-01-01, 2010-02-01, 2010-03-01, 2010-04-01, 2010-05-01, 2010-06-01
Season integer Low, Low, High, High, High, High
Time double 14610, 14641, 14669, 14700, 14730, 14761
Number of observations: 38
DFElantraTest = subset(DFElantra, RefDate > as.Date(strptime('25/02/2013', "%d/%m/%Y")))

FootNote = sprintf("Number of observations: %d",nrow(DFElantraTest)) 

kable(data.frame(Variable = names(DFElantraTest),
           Classe = sapply(DFElantraTest, typeof),
           First_Values = sapply(DFElantraTest, function(x) paste0(head(x),  collapse = ", ")),
           row.names = NULL),
      format = "html",
      caption = "Table 4: STR for Elantra Test Dataset",
      table.attr = "style = \"color: black; font-size: 11pt;\"") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  footnote(general = FootNote, general_title = "")

rm(FootNote)
Table 4: STR for Elantra Test Dataset
Variable Classe First_Values
Month integer 3, 4, 5, 6, 7, 8
ElantraSales integer 26153, 24445, 25090, 22163, 23958, 24700
Unemployment double 7.5, 7.5, 7.5, 7.5, 7.3, 7.2
Queries integer 313, 248, 252, 320, 274, 271
CPI_energy double 244.598, 238.86, 240.972, 245.412, 245.926, 244.917
CPI_all double 232.075, 231.707, 232.124, 232.86, 233.252, 233.433
RefDate double 2013-03-01, 2013-04-01, 2013-05-01, 2013-06-01, 2013-07-01, 2013-08-01
Season integer High, High, High, High, High, High
Time double 15765, 15796, 15826, 15857, 15887, 15918
Number of observations: 12

As a first model, let us include all original numerical variables, as so, not consider the time (month/year) information and applying stepwise optimization to discard unnecessary variables. CPI_All variable was also discarded due to its high correlation with the CPI_Energy and the fact that CPI_Energy has lower correlation with the unemployment variable.

Please notice that results are hidden on the notebook and will be presented on the conclusion. This was done to reduce the output size and improve readability.

# Create base model
Model = lm(ElantraSales ~ Unemployment + Queries + CPI_energy, data=DFElantraTrain)

# Apply stepwise optimization
Model = step(Model)

# Compute prediction with the test dataset
Prediction = predict(Model,newdata = DFElantraTest) 

# Create dataframe to store errors
DFError = as.data.frame(DFElantraTest$ElantraSales-Prediction)
names(DFError)[1] = "NoTime"

# Create dataframe to store r^2
SSE = sum((DFError$NoTime)^2)
SST = sum((DFElantraTest$ElantraSales-mean(DFElantraTrain$ElantraSales))^2)

RSQUARE = as.data.frame(1 - (SSE/SST))
names(RSQUARE)[1] = "NoTime"

# Create dataframe to store RMSE
DFRMSE = as.data.frame(sqrt(sum((DFError$NoTime)^2)/length(Prediction)))
names(DFRMSE)[1] = "NoTime"

# store model for later analysis, if necessary
ModelList = list(Model)
names(ModelList)[length(ModelList)]="NoTime"

summary(Model)

Let’s create a second model now using Unemployment, CPI_energy and Month. Since the most common months are the first two (5 occurrences), we are going to let the reference month in automatic.

It’s important to highlight that we are also dropping the “Queries” variable from this point own. It was dropped on the stepwise optimization and a few tests demonstrated that this variable has low significance and does not contribute to the model efficacy.

At this point, we are going also to create a function to test the model and store the results into lists and data frames for later presentation and analysis, this will reduce code length for each model.

Model = lm(ElantraSales ~ Unemployment + CPI_energy + Month, data=DFElantraTrain)

# create a function to test model and store results
test.model <- function(ModelLabel) { 
  # Compute prediction with the test dataset
  Prediction = predict(Model,newdata = DFElantraTest) 

  # Store erros in the appropriate dataframe
  PredError = DFElantraTest$ElantraSales-Prediction
  DFError[,ModelLabel] <<- as.data.frame(PredError)
  
  # Store r^2 in the appropriate dataframe
  SSE = sum((PredError)^2)

  RSQUARE[,ModelLabel] <<- as.data.frame(1 - (SSE/SST))

  # Store RMSE in the appropriate dataframe
  DFRMSE[,ModelLabel] <<- as.data.frame(sqrt(sum((PredError)^2)/length(Prediction)))
  
  ModelList[[ModelLabel]] <<- Model

  return(summary(Model))  
}

test.model("UnEnergyMonth")

Let’s create a third model similar to the previous but removing the variable CPI Energy. Although not displayed, the previous model presented low significance for the variable unemployment, something that is not expected since this variable is highly correlated with the sales volume.

This unexpected result might be consequence of the correlation between the independent variables and justifies this new model.

Model = lm(ElantraSales ~ Unemployment + Month, data=DFElantraTrain)

test.model("UnMonth")

Let’s create a fourth model using only the CPI_Energy and Month variables and thus completing this set.

Model = lm(ElantraSales ~ CPI_energy + Month, data=DFElantraTrain)

test.model("EnergyMonth")

We may now experiment with other three models, like the second, third and fourth, but replacing the variable Month by the sales season as previously discussed. For the categorical “season” variable, we shall use the value “High” as the reference since it is the one with more observations (6 out of 12 months).

DFElantraTrain$Season = relevel(DFElantraTrain$Season, "High")
DFElantraTest$Season = relevel(DFElantraTest$Season, "High")

Model = lm(ElantraSales ~ Unemployment + CPI_energy + Season, data=DFElantraTrain)

test.model("UnEnergySeason")
Model = lm(ElantraSales ~ Unemployment + Season, data=DFElantraTrain)

test.model("UnSeason")
Model = lm(ElantraSales ~ CPI_energy + Season, data=DFElantraTrain)

test.model("EnergySeason")

We may visually inspect the error vectors using a boxplot, as presented on figure 19. The model name is defined as a key to the variables used, e.g. NoTime uses all variables but the ones related with the time; UnEnergyMonth uses Unemployment, CPI Energy and Month Variables; UnSeason uses the variables Unemployment and Season.

Tables 5 and 6 presents the r2 and RMSE values for the various models, allowing more objective comparison.

MeltedError = melt(data = DFError, measure.vars = c("NoTime","UnEnergyMonth","UnMonth","EnergyMonth","UnEnergySeason","UnSeason","EnergySeason"))
names(MeltedError)[1] = 'Model'
names(MeltedError)[2] = 'Error'

plot13 = ggplot(MeltedError, aes(x=Model, y=Error)) +
  geom_boxplot() +
  geom_point() +
  guides(fill = FALSE) +
  stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4) +
  ggtitle("Figure 19: Error of the different models for the 12 points on the test dataset")

plot13

PlotTable = transpose(RSQUARE)
rownames(PlotTable) = colnames(RSQUARE)
colnames(PlotTable) = c("r.square")

kable(PlotTable,
      format = "html",
      format.args = list(digits = 2, nsmall = 2),
      caption = "Table 5: r.square values for the various models over the test dataset",
      table.attr = "style = \"color: black; font-size: 11pt;\"") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

PlotTable = transpose(DFRMSE)
rownames(PlotTable) = colnames(DFRMSE)
colnames(PlotTable) = c("RMSE")

kable(PlotTable,
      format = "html",
      format.args = list(digits = 2, nsmall = 2),
      caption = "Table 6: RMSE values for various models over the test dataset",
      table.attr = "style = \"color: black; font-size: 11pt;\"") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# rm(DFElantraTest, DFElantraTrain, DFError, DFRMSE, Model, ModelList, MeltedError, PlotTable, RSQUARE, SSE, SST, Prediction)
Table 5: r.square values for the various models over the test dataset
r.square
NoTime 0.47
UnEnergyMonth 0.78
UnMonth 0.88
EnergyMonth 0.51
UnEnergySeason 0.81
UnSeason 0.89
EnergySeason 0.53
Table 6: RMSE values for various models over the test dataset
RMSE
NoTime 5558.39
UnEnergyMonth 3586.76
UnMonth 2601.53
EnergyMonth 5315.38
UnEnergySeason 3344.59
UnSeason 2510.07
EnergySeason 5208.79

From the above results we conclude that the best models (less error on the predicted values) are the ones based on the combination of the month or season and the use of the unemployment variable. All variables included on the selected models are significant and these models also present beta coefficients that are in accordance with the expected values, considering the positive or negative correlation of the variables to the Elantra sales.

It’s interesting to notice also that the model that uses the Energy CPI and the season presented very low dispersion around its median value, although the error is quite high. This result might be an indication that this model might provide better results if another variable, less correlated with the existing ones, is included.

For these reasons, let’s proceed with our evaluation creating new models based on the Unemployment variable, combined with season and Month.

Step 5. Validation

The final model (or maybe the best 2-3 models) should then be put through the validation process. In this process, you test the model using completely new data set i.e. data that was not used to build the model. This process ensures that your model is a good model in general and not just a very good model for the specific data earlier used (Technically, this is called avoiding over fitting).

Since both models are not directly correlated with a sequence of time but strictly to a series of variables in each month, we may try to split the dataset into smaller blocks and perform a series of trains and tests. This would allow us more train and test trials and decide on a specific model with more reliability. This process is known as cross-validation.

Using such idea, the following code take blocks of 5 months as a test dataset and the remaining months for training the models, performing these 10 times with different blocks within the 50 months that comprises the complete dataset.

Since the data was ordered in time, each test corresponds to five consecutive months, such as that the first test set have its data from January 2010 until May 2010 and the last test set have its data from October 2013 until February 2014.

Figures 20 to 23 compare the errors on each test set using boxplots, firstly the absolute values and later as a percentage of the actual value. This data is summarised on table 7.

Later, on figures 24 and table 8, the RMSE values for both models are compared.

DFSize = nrow(DFElantra)
SampleSize = 5
TestID = 1
for(i in seq(from=1, to=DFSize, by=SampleSize)) {
  # create string with name to the test
  TestSetName = sprintf('TestSet%d',TestID)
  
  # create test set
  DFElantraTest = DFElantra[seq(from=i, to=i+SampleSize-1),]
  
  # create training set with the remaining data handling the first and last cases in specific ways
  if (i==1) {
    DFElantraTrain = tail(DFElantra, DFSize - SampleSize)

    #initialize dataframes
    DFMonthError = as.data.frame(seq(from=1, to=SampleSize))
    names(DFMonthError)[1] = TestSetName
    DFSeasonError = DFMonthError
    DFMonthErrorP = DFMonthError
    DFSeasonErrorP = DFMonthError
    DFRMSE = as.data.frame(0)
    names(DFRMSE)[1] = "Month"
  } else if (i==(DFSize-SampleSize)) {
    DFElantraTrain = head(DFElantra, DFSize - SampleSize)
  } else {
    DFElantraTrain = rbind(DFElantra[seq(from=1, to=i-1),],DFElantra[seq(from=i+SampleSize, to=DFSize),])
  }
  DFElantraTrain$Season = relevel(DFElantraTrain$Season, "High")
  
  # train model
  MonthModel = lm(ElantraSales ~ Unemployment + Month, data=DFElantraTrain)
  SeasonModel = lm(ElantraSales ~ Unemployment + Season, data=DFElantraTrain)
  
  # test model
  MonthModelPrediction = predict(MonthModel,newdata = DFElantraTest) 
  SeasonModelPrediction = predict(SeasonModel,newdata = DFElantraTest)

  # compute error
  DFMonthError[,TestSetName] = DFElantraTest$ElantraSales-MonthModelPrediction
  DFSeasonError[,TestSetName] = DFElantraTest$ElantraSales-SeasonModelPrediction
  
  # compute percentual error
  DFMonthErrorP[,TestSetName] = DFMonthError[,TestSetName] / DFElantraTest$ElantraSales
  DFSeasonErrorP[,TestSetName] = DFSeasonError[,TestSetName] / DFElantraTest$ElantraSales

    # compute RMSE
  DFRMSE[TestID,'Month'] = sqrt(sum((DFMonthError[,TestSetName])^2)/length(MonthModelPrediction)) 
  DFRMSE[TestID,'Season'] = sqrt(sum((DFSeasonError[,TestSetName])^2)/length(SeasonModelPrediction)) 
  
  TestID = TestID + 1
}

MonthMeltedError = melt(data = DFMonthError, measure.vars = c("TestSet1", "TestSet2", "TestSet3", "TestSet4", "TestSet5", "TestSet6", "TestSet7", "TestSet8", "TestSet9", "TestSet10"))
names(MonthMeltedError)[1] = 'Month'
names(MonthMeltedError)[2] = 'Error'

plot14 = ggplot(MonthMeltedError, aes(x=Month, y=Error)) +
  geom_boxplot() +
  geom_point() +
  guides(fill = FALSE) +
  stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4) +
  ggtitle("Figure 20: Error on each test set for model using the month variable")

SeasonMeltedError = melt(data = DFSeasonError, measure.vars = c("TestSet1", "TestSet2", "TestSet3", "TestSet4", "TestSet5", "TestSet6", "TestSet7", "TestSet8", "TestSet9", "TestSet10"))
names(SeasonMeltedError)[1] = 'Season'
names(SeasonMeltedError)[2] = 'Error'

plot15 = ggplot(SeasonMeltedError, aes(x=Season, y=Error)) +
  geom_boxplot() +
  geom_point() +
  guides(fill = FALSE) +
  stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4) +
  ggtitle("Figure 21: Error on each test set for model using the season variable")

MonthMeltedErrorP = melt(data = DFMonthErrorP, measure.vars = c("TestSet1", "TestSet2", "TestSet3", "TestSet4", "TestSet5", "TestSet6", "TestSet7", "TestSet8", "TestSet9", "TestSet10"))
names(MonthMeltedErrorP)[1] = 'Month'
names(MonthMeltedErrorP)[2] = 'Percentual.Error'

plot16 = ggplot(MonthMeltedErrorP, aes(x=Month, y=Percentual.Error)) +
  geom_boxplot() +
  geom_point() +
  guides(fill = FALSE) +
  stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4) +
  ggtitle("Figure 22: Percentual error on each test set for model using the month variable")

SeasonMeltedErrorP = melt(data = DFSeasonErrorP, measure.vars = c("TestSet1", "TestSet2", "TestSet3", "TestSet4", "TestSet5", "TestSet6", "TestSet7", "TestSet8", "TestSet9", "TestSet10"))
names(SeasonMeltedErrorP)[1] = 'Season'
names(SeasonMeltedErrorP)[2] = 'Percentual.Error'

plot17 = ggplot(SeasonMeltedErrorP, aes(x=Season, y=Percentual.Error)) +
  geom_boxplot() +
  geom_point() +
  guides(fill = FALSE) +
  stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4) +
  ggtitle("Figure 23: Percentual error on each test set for model using the season variable")

Month.Percentual.Error=summary(100*MonthMeltedErrorP$Percentual.Error)
Season.Percentual.Error=summary(100*SeasonMeltedErrorP$Percentual.Error)

kable(rbind(Month.Percentual.Error,Season.Percentual.Error),
      format = "html",
      format.args = list(digits = 1, nsmall = 1),
      caption = "Table 7: Percentual Errors for both models",
      table.attr = "style = \"color: black; font-size: 11pt;\"") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

plot14; plot15; plot16; plot17

MeltedError = melt(data = DFRMSE, measure.vars = c("Month", "Season"))
names(MeltedError)[1] = 'Model'
names(MeltedError)[2] = 'RMSE'

plot18 = ggplot(MeltedError, aes(x=Model, y=RMSE)) +
  geom_boxplot() +
  geom_point() +
  guides(fill = FALSE) +
  stat_summary(fun.y = mean, geom = "point", shape = 5, size = 4) +
  ggtitle("Figure 24: RMSE for each model over for the 10 test sets")

#datatable(DFRMSE[order(DFRMSE$Season),] , caption = 'Table 7: RMSE for each test for both models')

kable(DFRMSE[order(DFRMSE$Season),],
      format = "html",
      format.args = list(digits = 2, nsmall = 2),
      caption = "Table 8: RMSE for each test for both models",
      table.attr = "style = \"color: black; font-size: 11pt;\"") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

plot18 

# rm(MeltedError, DFError,DFElantraTest,DFElantraTrain, DFRMSE, ComplexModelPrediction, ComplexModel, SimpleModelPrediction, SimpleModel)
Table 7: Percentual Errors for both models
Min. 1st Qu. Median Mean 3rd Qu. Max.
Month.Percentual.Error -102.5 -12.6 0.7 -3.4 12.5 32.8
Season.Percentual.Error -84.2 -12.1 1.0 -3.2 10.5 27.7
Table 8: RMSE for each test for both models
Month Season
7 1417.73 1418.22
9 2275.26 1787.33
5 1879.25 1815.91
2 2181.23 1822.08
3 2071.83 1941.72
6 2390.88 2000.79
8 2880.36 2637.09
10 2785.18 2835.01
4 4503.32 3742.04
1 5842.57 4861.04

Considering the 10 tests performed, we conclude that the model that uses the created “season” variable is more effective and efficient and should be used for predicting Elantra sales.

Step 6. Implementation and tracking

The final model is chosen after the validation. Then you start implementing the model and tracking the results. You need to track results to see the performance of the model over time. In general, the accuracy of a model goes down over time. How much time will really depend on the variables - how dynamic or static they are, and the general environment - how static or dynamic that is.

This step is beyond the scope of the present work.

Conclusion

Considering the supplied dataset, we conducted an analysis of the variables considered and were able to propose a model to predict the amount of sales of the Hyundai Elantra model.

The proposed model uses a new variable that is related to the idea of “sales season”. This categorical variable can assume 3 values: High, Mid and Low season and was created to distinguish between times of the year when systematically there are more sales, such as close to the release of a new model, and times when there are less sales.

The model predict that sales are negatively related to the unemployment rate, and affected by sales season created by externalities, such as the release of new car design. The model is of simple application and understanding.

During test performed the proposed model incurred in errors that varied from about -7000 up to about 4300 units, with RMSE of about 4800 units, which corresponds to about 30% of the median sales value. This percentage is exceedingly high and further studies are needed to incorporate in the model other variables, less correlated with the existing ones, that might be able to improve such estimates.

One may also evaluate the use of more complex models that may consider the time series and use the idea that near future sales are deeply related to the near past sales, something that is not possible using this type of linear modelling.