Lab 2 - An analytical detective

Student

Introduction

Crime is an international concern, but it is documented and handled in very different ways in different countries. In the United States, violent crimes and property crimes are recorded by the Federal Bureau of Investigation (FBI). Additionally, each city documents crime, and some cities release data regarding crime rates. The city of Chicago, Illinois releases crime data from 2001 onward online. Chicago is the third most populous city in the United States, with a population of over 2.7 million people. The city of Chicago is shown in the map below, with the state of Illinois highlighted in red.

There are two main types of crimes: violent crimes, and property crimes. In this problem, we’ll focus on one specific type of property crime, called “motor vehicle theft” (sometimes referred to as grand theft auto). This is the act of stealing, or attempting to steal, a car. In this problem, we’ll use some basic data analysis in R to understand the motor vehicle thefts in Chicago.

Please download the file mvtWeek1.csv for this problem (do not open this file in any spreadsheet software before completing this problem because it might change the format of the Date field). Here is a list of descriptions of the variables:

Problem 1 - Loading the Data

Read the dataset mvtWeek1.csv into R, using the read.csv function, and call the data frame “mvt”. Remember to navigate to the directory on your computer containing the file mvtWeek1.csv first. It may take a few minutes to read in the data, since it is pretty large. Then, use the str and summary functions to answer the following questions.

mvt <- read.csv("./mvtWeek1.csv")
summary(mvt)
##        ID                      Date       
##  Min.   :1310022   5/16/08 0:00  :    11  
##  1st Qu.:2832144   10/17/01 22:00:    10  
##  Median :4762956   4/13/04 21:00 :    10  
##  Mean   :4968629   9/17/05 22:00 :    10  
##  3rd Qu.:7201878   10/12/01 22:00:     9  
##  Max.   :9181151   10/13/01 22:00:     9  
##                    (Other)       :191582  
##                      LocationDescription   Arrest         Domestic      
##  STREET                        :156564   Mode :logical   Mode :logical  
##  PARKING LOT/GARAGE(NON.RESID.): 14852   FALSE:176105    FALSE:191226   
##  OTHER                         :  4573   TRUE :15536     TRUE :415      
##  ALLEY                         :  2308   NA's :0         NA's :0        
##  GAS STATION                   :  2111                                  
##  DRIVEWAY - RESIDENTIAL        :  1675                                  
##  (Other)                       :  9558                                  
##       Beat         District     CommunityArea        Year     
##  Min.   : 111   Min.   : 1.00   Min.   : 0      Min.   :2001  
##  1st Qu.: 722   1st Qu.: 6.00   1st Qu.:22      1st Qu.:2003  
##  Median :1121   Median :10.00   Median :32      Median :2006  
##  Mean   :1259   Mean   :11.82   Mean   :38      Mean   :2006  
##  3rd Qu.:1733   3rd Qu.:17.00   3rd Qu.:60      3rd Qu.:2009  
##  Max.   :2535   Max.   :31.00   Max.   :77      Max.   :2012  
##                 NA's   :43056   NA's   :24616                 
##     Latitude       Longitude     
##  Min.   :41.64   Min.   :-87.93  
##  1st Qu.:41.77   1st Qu.:-87.72  
##  Median :41.85   Median :-87.68  
##  Mean   :41.84   Mean   :-87.68  
##  3rd Qu.:41.92   3rd Qu.:-87.64  
##  Max.   :42.02   Max.   :-87.52  
##  NA's   :2276    NA's   :2276

1.1 - How many rows

How many rows of data (observations) are in this dataset?

answer = nrow(mvt)
cat(sprintf("Number of rows = %d\n", answer))
## Number of rows = 191641

1.2 - How many variables

How many variables are in this dataset?

answer = ncol(mvt)
cat(sprintf("Number of variables = %d\n", answer))
## Number of variables = 11

1.3 - Maximum value

Using the “max” function, what is the maximum value of the variable “ID”?

answer = max(mvt$ID)
cat(sprintf("Maximum value of \"ID\" = %d\n", answer))
## Maximum value of "ID" = 9181151

1.4 - Minimum value

What is the minimum value of the variable “Beat”?

answer = min(mvt$Beat)
cat(sprintf("Minimum value of \"Beat\" = %d\n", answer))
## Minimum value of "Beat" = 111

1.5 - How many arrest

How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?

answer = nrow(mvt[mvt$Arrest == TRUE,])
cat(sprintf("Number of arrest = %d\n", answer))
## Number of arrest = 15536

1.6 - How many observations on ALLEY

How many observations have a LocationDescription value of ALLEY?

answer = nrow(mvt[mvt$LocationDescription == 'ALLEY',])
cat(sprintf("Number of LocationDescription with \"ALLEY\" = %d\n", answer))
## Number of LocationDescription with "ALLEY" = 2308

Problem 2 - Understanding Dates in R

In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).

2.1 What is the date format

In what format are the entries in the variable Date?

# display the labels of a few elements at spreed across the dataset
cat("Data is of class: ",class(mvt$Date),"\n\n",
    "Element      1:",as.character(mvt[1,"Date"]),"\n",
    "Element      2:",as.character(mvt[2,"Date"]),"\n",
    "Element      5:",as.character(mvt[5,"Date"]),"\n",
    "Element    100:",as.character(mvt[100,"Date"]),"\n",
    "Element   1000: ",as.character(mvt[1000,"Date"]),"\n",
    "Element 10 000: ",as.character(mvt[10000,"Date"]),"\n",
    "Element 20 000:  ",as.character(mvt[20000,"Date"]),"\n",
    "Element 30 000:",as.character(mvt[30000,"Date"]),"\n")
## Data is of class:  factor 
## 
##  Element      1: 12/31/12 23:15 
##  Element      2: 12/31/12 22:00 
##  Element      5: 12/31/12 21:30 
##  Element    100: 12/29/12 14:00 
##  Element   1000:  12/6/12 19:30 
##  Element 10 000:  4/17/12 12:00 
##  Element 20 000:   8/4/11 22:30 
##  Element 30 000: 12/26/10 20:00

Answer: Evaluating a sample of the data, one may infer that the information is presented in descending chronological order, as such, the variable date must be in the format “%m/%d/%y %H:%M”.

2.2 - Median date

Now, let’s convert these characters into a Date object in R. In your R console, type

# set locality to present date information in english
answer = Sys.setlocale("LC_TIME","English_United States.1252")

DateConvert = as.Date(strptime(mvt$Date, "%m/%d/%y %H:%M"))

This converts the variable “Date” into a Date object in R. Take a look at the variable DateConvert using the summary function.

What is the month and year of the median date in our dataset?

answer = median(DateConvert)
cat(sprintf("Median date on the dataset = %s\n", format(answer, format="%B %Y")))
## Median date on the dataset = May 2006

2.3 - Month with fewest thefts

Now, let’s extract the month and the day of the week, and add these variables to our data frame mvt. We can do this with two simple functions. Type the following commands in R:

mvt$Month = months(DateConvert)

mvt$Weekday = weekdays(DateConvert)

mvt$Date = DateConvert

# delete from worspace variables that are unnecesary for later processing 
rm(DateConvert)

This creates two new variables in our data frame, Month and Weekday, and sets them equal to the month and weekday values that we can extract from the Date object. Lastly, replace the old Date variable with DateConvert by typing:

Using the table command, answer the following questions.

In which month did the fewest motor vehicle thefts occur?

MonthCount = table(mvt$Month)

answer = names(MonthCount[MonthCount==min(MonthCount)])

cat(sprintf("Month with fewest theft = %s\n", answer))
## Month with fewest theft = February
# delete from worspace variables that are unnecesary for later processing 
rm(MonthCount)

2.4 - Weekday with fewest thefts

On which weekday did the most motor vehicle thefts occur?

WeekdayCount = table(mvt$Weekday)

answer = names(WeekdayCount[WeekdayCount==max(WeekdayCount)])

cat(sprintf("Weekday with fewest theft = %s\n", answer))
## Weekday with fewest theft = Friday
# delete from worspace variables that are unnecesary for later processing 
rm(WeekdayCount)

2.5 - Month with most thefts and arrest

Each observation in the dataset represents a motor vehicle theft, and the Arrest variable indicates whether an arrest was later made for this theft. Which month has the largest number of motor vehicle thefts for which an arrest was made?

DFArrestTrue = mvt[mvt$Arrest==TRUE,]

DFArrestTrueCount = table(DFArrestTrue$Month)

answer = names(DFArrestTrueCount[DFArrestTrueCount==max(DFArrestTrueCount)])

cat(sprintf("Month with more theft and arrest = %s\n", answer))
## Month with more theft and arrest = January
# delete from worspace variables that are unnecesary for later processing 
rm(DFArrestTrue,DFArrestTrueCount)