Crista Gregg and Halid Kopanski 7/2/2021
The following analysis breaks down bicycle sharing usage based on data gathered for every recorded Tuesday in the years 2011 and 2012. The data was gathered from users of Capitol Bikeshare based in Washington DC. In total, the dataset contains 731 entries. For each entry, 16 variables were recorded. The following is the list of the 16 variables and a short description of each:
Variable | Description |
---|---|
instant | record index |
dteday | date |
season | season (winter, spring, summer, fall) |
yr | year (2011, 2012) |
mnth | month of the year |
holiday | whether that day is holiday (1) or not (0) |
weekday | day of the week |
workingday | if day is neither a weekend nor a holiday value is 1, otherwise is 0. |
weathersit | Description of weather conditions (see below) |
- | 1: Clear, Few clouds, Partly cloudy, Partly cloudy |
- | 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist |
- | 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds |
- | 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog |
temp | Normalized temperature in Celsius. |
atemp | Normalized perceived temperature in Celsius. |
hum | Normalized humidity. |
windspeed | Normalized wind speed. |
casual | count of casual users |
registered | count of registered users |
cnt | sum of both casual and registered users |
Sources | Raw data and more information can be found here |
In addition to summary statistics, this report will also model bicycle users by linear regression, random forests, and boosting. The model will help determine anticipated number of users based on readily available data. To achieve this, the response variables are casual, registered, and cnt. The other variables, not including the date and instant columns, will be the predictors for models developed later in this report.
Here, we set up the data for the selected day of week and convert categorical variables to factors, and then split the data into a train and test set.
set.seed(1) #get the same splits every time
<- read_csv('day.csv')
bikes
<- function(x){
day_function <- x + 1
x switch(x,"Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday")
}
<- function(x){
season_function #x <- as.character(x)
switch(x, "Spring",
"Summer",
"Fall",
"Winter")
}
<- bikes %>% select(everything()) %>%
bikes mutate(weekday = sapply(weekday, day_function),
season = sapply(season, season_function))
$season <- as.factor(bikes$season)
bikes$yr <- as.factor(bikes$yr)
bikeslevels(bikes$yr) <- c('2011','2012')
$mnth <- as.factor(bikes$mnth)
bikes$holiday <- as.factor(bikes$holiday)
bikes$weekday <- as.factor(bikes$weekday)
bikes$workingday <- as.factor(bikes$workingday)
bikes$weathersit <- as.factor(bikes$weathersit)
bikeslevels(bikes$weathersit) <- c('Clear to some clouds', 'Misty', 'Light snow or rain')
<- params$day_of_week
day
#filter bikes by day of week
<- filter(bikes, weekday == day)
bikes
#split data into train and test sets
<- sample(nrow(bikes), 0.7*nrow(bikes))
train_rows <- bikes[train_rows,] %>%
train select(-instant, -weekday, -casual, -registered, -dteday)
<- bikes[-train_rows,] %>%
test select(-instant, -weekday, -casual, -registered, -dteday)
Below shows the summary statistics of bike users: casual, registered, and total.
::kable(summary(bikes[,14:16])) knitr
casual | registered | cnt | |
---|---|---|---|
Min. : 9.0 | Min. : 573 | Min. : 683 | |
1st Qu.: 227.0 | 1st Qu.:3282 | 1st Qu.:3579 | |
Median : 542.5 | Median :3943 | Median :4576 | |
Mean : 556.2 | Mean :3954 | Mean :4511 | |
3rd Qu.: 805.5 | 3rd Qu.:5104 | 3rd Qu.:5769 | |
Max. :1348.0 | Max. :6697 | Max. :7767 |
The following table tells us the total number of rentals for each of the two years of collected data, as well as the average number of rentals per day.
%>%
bikes group_by(yr) %>%
summarise(total_rentals = sum(cnt), avg_rentals = round(mean(cnt))) %>%
::kable() knitr
yr | total_rentals | avg_rentals |
---|---|---|
2011 | 180338 | 3468 |
2012 | 288771 | 5553 |
Now we will look at the number of days with each type of weather by season. 1 represents ‘Clear to some clouds’, 2 represents ‘Misty’, and 3 represents ‘Light snow or rain’.
::kable(table(bikes$season, bikes$weathersit)) knitr
Clear to some clouds | Misty | Light snow or rain | |
---|---|---|---|
Fall | 21 | 5 | 1 |
Spring | 16 | 9 | 0 |
Summer | 14 | 12 | 0 |
Winter | 11 | 12 | 3 |
The following box plot shows us how many rentals we have for days that are sunny or partly cloudy, misty, or rainy/snowy. We may expect some differences in behavior between weekend days where less people might be inclined to ride their bikes for pleasure, versus weekdays when more people might brave moderately unpleasant weather to get to work.
ggplot(bikes, aes(factor(weathersit), cnt)) +
geom_boxplot() +
labs(x = 'Type of Weather', y = 'Number of Rental Bikes', title = 'Rental Bikes by Type of Weather') +
theme_minimal()
<- bikes %>%
weather_summary group_by(weathersit) %>%
summarise(total_rentals = sum(cnt), avg_rentals = round(mean(cnt)))
<- switch(which.min(weather_summary$avg_rentals),
weather_min "clear weather",
"misty weather",
"weather with light snow or rain")
According to the above box plot, it can be seen that weather with light snow or rain brings out the least amount of total users.
Below is a chart of the relationship between casual and registered bikers. We might expect a change in the slope if we look at different days of the week. Perhaps we see more registered bikers riding on the weekday but more casual users on the weekend.
ggplot(bikes, aes(casual, registered)) +
geom_point() +
geom_smooth(formula = 'y ~ x', method = 'lm') +
theme_minimal() +
labs(title = 'Registered versus Casual Renters')
Below we see a plot of the average daily number of bikers by month. We should expect to see more bikers in the spring and summer months, and the least in the winter.
<- bikes %>%
plot_mth group_by(mnth) %>%
summarize(avg_bikers = mean(cnt))
ggplot(plot_mth, aes(mnth, avg_bikers)) +
geom_line(group = 1, color = 'darkblue', size = 1.2) +
geom_point(size = 2) +
theme_minimal() +
labs(title='Average daily number of bikers by month', y = 'Average Daily Bikers', x = 'Month') +
scale_x_discrete(labels = month.abb)
<- month.name[which.max(plot_mth$avg_bikers)]
month_max <- month.name[which.min(plot_mth$avg_bikers)]
month_min
<- max(plot_mth$avg_bikers)
user_max <- min(plot_mth$avg_bikers)
user_min
<- rep(0, 11)
changes <- rep("x", 11)
diff_mth
for (i in 2:12){
- 1] <- paste(month.name[i - 1], "to", month.name[i])
diff_mth[i - 1] <- round(plot_mth$avg_bikers[i] - plot_mth$avg_bikers[i - 1])
changes[i
}
<- as_tibble(cbind(diff_mth, changes)) diff_tab_mth
According to the graph, August has the highest number of users with a value of 5930. The month with the lowest number of users is January with an average of 2568.
The largest decrease in month to month users was October to November with an average change of -1024.
The largest increase in month to month users was March to April with an average change of 982.
We would like to see what effect public holidays have on the types of bicycle users on average for a given day. In this case, Tuesday data shows the following relationships:
%>% ggplot(aes(x = as.factor(workingday), y = casual)) + geom_boxplot() +
bikes labs(title = paste("Casual Users on", params$day_of_week)) +
xlab("") +
ylab("Casual Users") +
scale_x_discrete(labels = c('Public Holiday', 'Workday')) +
theme_minimal()
%>% ggplot(aes(x = as.factor(workingday), y = registered)) + geom_boxplot() +
bikes labs(title = paste("Registered Users on", params$day_of_week)) +
xlab("") +
ylab("Registered Users") +
scale_x_discrete(labels = c('Public Holiday', 'Workday')) +
theme_minimal()
Temperature and humidity have an effect on the number of users on a given day.
First, normalized temperature data (both actual temperature and perceived):
<- bikes %>% select(cnt, temp, atemp) %>%
bike_temp gather(key = type, value = temp_norm, temp, atemp, factor_key = FALSE)
ggplot(bike_temp, aes(x = temp_norm, y = cnt, col = type, shape = type)) +
geom_point() + geom_smooth(formula = 'y ~ x', method = 'loess') +
scale_color_discrete(name = "Temp Type", labels = c("Perceived", "Actual")) +
scale_shape_discrete(name = "Temp Type", labels = c("Perceived", "Actual")) +
labs(title = paste("Temperature on", params$day_of_week, "(Actual and Perceived)")) +
xlab("Normalized Temperatures") +
ylab("Total Users") +
theme_minimal()
Next the effect of humidity:
%>% ggplot(aes(x = hum, y = cnt)) + geom_point() + geom_smooth(formula = 'y ~ x', method = 'loess') +
bikeslabs(title = paste("Humidity versus Total Users on", params$day_of_week)) +
xlab("Humidity (normalized)") +
ylab("Total Number of Users") +
theme_minimal()
Here we are checking the correlation between the numeric predictors in the data.
::kable(round(cor(bikes[ , c(11:16)]), 3)) knitr
atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|
atemp | 1.000 | 0.117 | -0.122 | 0.789 | 0.585 | 0.644 |
hum | 0.117 | 1.000 | -0.101 | -0.184 | -0.171 | -0.180 |
windspeed | -0.122 | -0.101 | 1.000 | -0.207 | -0.132 | -0.151 |
casual | 0.789 | -0.184 | -0.207 | 1.000 | 0.771 | 0.842 |
registered | 0.585 | -0.171 | -0.132 | 0.771 | 1.000 | 0.993 |
cnt | 0.644 | -0.180 | -0.151 | 0.842 | 0.993 | 1.000 |
corrplot(cor(bikes[ , c(11:16)]), method = "circle")
Now, we will fit two linear regression model, a random forest model, and a boosting model. We will use cross-validation to select the best tuning parameters for the ensemble based methods, and then compare all four models using the test MSE.
Linear regression is one of the most common methods for modeling. It looks at a set of predictors and estimates what will happen to the response if one of the predictors or a combination of predictors change. This model is highly interpretable, as it shows us the effect of each individual predictor as well as interactions. We can see if the change in the response goes up or down and in what quantity. The model is chosen by minimizing the squares of the distances between the estimated value and the actual value in the testing set. Below we fit two different linear regression models.
The first model will have a subset of predictors chosen by stepwise selection. Once we have chosen an interesting set of predictors, we will use cross-validation to determine the RMSE and R2.
<- lm(cnt ~ ., data = train[ , c(1:3, 6:11)])
lm_fit_select <- step(lm_fit_select) model
## Start: AIC=975.89
## cnt ~ season + yr + mnth + weathersit + temp + atemp + hum +
## windspeed
##
## Df Sum of Sq RSS AIC
## - mnth 11 7577159 37663780 970.06
## - weathersit 2 1593831 31680452 975.61
## <none> 30086621 975.89
## - windspeed 1 1434757 31521378 977.24
## - hum 1 2629246 32715867 979.92
## - temp 1 2961179 33047800 980.65
## - atemp 1 5174461 35261082 985.32
## - season 3 9799080 39885701 990.19
## - yr 1 54900983 84987604 1048.66
##
## Step: AIC=970.06
## cnt ~ season + yr + weathersit + temp + atemp + hum + windspeed
##
## Df Sum of Sq RSS AIC
## - weathersit 2 1683110 39346890 969.21
## - windspeed 1 647040 38310820 969.29
## <none> 37663780 970.06
## - temp 1 3047156 40710936 973.66
## - hum 1 4132497 41796276 975.56
## - atemp 1 5490735 43154515 977.86
## - season 3 15438171 53101951 988.80
## - yr 1 55424371 93088151 1033.21
##
## Step: AIC=969.21
## cnt ~ season + yr + temp + atemp + hum + windspeed
##
## Df Sum of Sq RSS AIC
## - windspeed 1 744671 40091561 968.56
## <none> 39346890 969.21
## - temp 1 3255073 42601963 972.93
## - atemp 1 6032726 45379616 977.48
## - season 3 15875797 55222687 987.62
## - hum 1 14671440 54018330 990.03
## - yr 1 54671371 94018261 1029.93
##
## Step: AIC=968.56
## cnt ~ season + yr + temp + atemp + hum
##
## Df Sum of Sq RSS AIC
## <none> 40091561 968.56
## - temp 1 4260597 44352158 973.83
## - atemp 1 7815336 47906897 979.38
## - season 3 17112995 57204556 988.15
## - hum 1 14833016 54924576 989.23
## - yr 1 54174919 94266479 1028.12
<- names(model$model)
variables #variables we will use for our model variables
## [1] "cnt" "season" "yr" "temp" "atemp" "hum"
set.seed(10)
<- train(cnt ~ ., data = train[variables], method = 'lm',
lm.fit preProcess = c('center', 'scale'),
trControl = trainControl(method = 'cv', number = 10))
lm.fit
## Linear Regression
##
## 72 samples
## 5 predictor
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 64, 64, 65, 64, 65, 66, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 803.4752 0.8484423 653.3827
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Our first linear model has an RMSE of 803.48.
Adding interactions to the terms included in the first model.
set.seed(10)
<- train(cnt ~ . + .*., data = train[variables], method = 'lm',
lm.fit1 preProcess = c('center', 'scale'),
trControl = trainControl(method = 'cv', number = 10))
lm.fit1
## Linear Regression
##
## 72 samples
## 5 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 64, 64, 65, 64, 65, 66, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 946.7255 0.7783881 705.762
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The RMSE value of the model changed to 946.73.
Ensemble trees methods come in many types and are very versatile when it comes to regression or classification. For the following, we will be using the two most common and well known methods: Random Forests (a form of bagging) and Boosting. Both these tree based methods involve optimization during the development process. In the case of random forests, the optimization involves varying the number of predictors used. This is done to mitigate the effects of one or more predictors from overshadowing other predictors. Boosting is a method where the final model is developed through an iterative combination of weaker models where each iteration builds upon the last. While both methods are very flexible and tend to process good results, the models themselves are not as interpretable as linear regression. We normally just analyze the output of the models.
Below is the result of training with the random forest method. This method uses a different subset of predictors for each tree and averages the results across many trees, selected by bootstrapping. By reducing the number of predictors considered in each tree, we may be able to reduce the correlation between trees to improve our results. In the training model below, we vary the number of predictors used in each tree.
<- train(cnt ~ ., data = train, method = 'rf',
rf_fit preProcess = c('center', 'scale'),
tuneGrid = data.frame(mtry = 1:10))
rf_fit
## Random Forest
##
## 72 samples
## 10 predictors
##
## Pre-processing: centered (23), scaled (23)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 72, 72, 72, 72, 72, 72, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 1409.7217 0.5601937 1147.5424
## 2 1172.9163 0.6287130 970.1518
## 3 1066.9271 0.6735929 871.2112
## 4 1017.3531 0.6914595 814.1103
## 5 980.7443 0.7072339 769.9421
## 6 956.6632 0.7180092 739.0092
## 7 942.4831 0.7230813 717.4531
## 8 929.6959 0.7275607 698.7745
## 9 928.0898 0.7249071 692.5225
## 10 923.5750 0.7257036 683.1027
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 10.
The best model uses 10 predictors. This gives an RMSE of 923.57.
The following are the results of Boosting model development using the provided bike data.
<- trainControl(method = "repeatedcv",
trctrl number = 10,
repeats = 3)
set.seed(2020)
<- expand.grid(n.trees = c(20, 100, 500),
boost_grid interaction.depth = c(1, 3, 5),
shrinkage = c(0.1, 0.01, 0.001),
n.minobsinnode = 10)
<- train(cnt ~ .,
boost_fit data = train,
method = "gbm",
verbose = F, #suppresses excessive printing while model is training
trControl = trctrl,
tuneGrid = data.frame(boost_grid))
A total of 27 models were evaluated. Each differing by the combination of boosting parameters. The results are show below:
print(boost_fit)
## Stochastic Gradient Boosting
##
## 72 samples
## 10 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 66, 64, 64, 64, 64, 64, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.001 1 20 1733.4778 0.5997355 1387.4202
## 0.001 1 100 1686.9318 0.6177627 1351.7161
## 0.001 1 500 1497.5223 0.6852299 1197.8080
## 0.001 3 20 1730.6068 0.7106399 1384.6644
## 0.001 3 100 1671.5146 0.7193269 1336.1699
## 0.001 3 500 1438.7752 0.7459447 1138.4866
## 0.001 5 20 1730.3416 0.7140856 1384.2059
## 0.001 5 100 1671.5330 0.7246857 1336.0054
## 0.001 5 500 1438.9438 0.7395187 1139.3725
## 0.010 1 20 1635.2834 0.5996410 1311.2101
## 0.010 1 100 1332.0244 0.7108082 1065.5038
## 0.010 1 500 907.1474 0.7940720 735.1953
## 0.010 3 20 1602.3316 0.7281379 1277.1202
## 0.010 3 100 1254.9025 0.7509281 987.3399
## 0.010 3 500 890.9477 0.7984919 727.2590
## 0.010 5 20 1608.1380 0.7249979 1283.2532
## 0.010 5 100 1246.1601 0.7571994 981.7668
## 0.010 5 500 888.3036 0.7976794 728.4051
## 0.100 1 20 1130.0346 0.7421134 916.4738
## 0.100 1 100 827.6603 0.8131856 678.1385
## 0.100 1 500 856.7059 0.7987222 681.1798
## 0.100 3 20 1070.3678 0.7452824 869.5596
## 0.100 3 100 859.8338 0.7962112 697.7899
## 0.100 3 500 887.8779 0.7885376 706.6418
## 0.100 5 20 1049.3031 0.7636127 857.1309
## 0.100 5 100 856.2400 0.8026035 691.9272
## 0.100 5 500 881.5841 0.7945837 701.1915
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100, interaction.depth = 1, shrinkage
## = 0.1 and n.minobsinnode = 10.
<- as_tibble(boost_fit$results[,c(1,2,4:6)]) results_tab
The attributes of the best model is shown here.
<- which.min(results_tab$RMSE)
boost_min
::kable(results_tab[boost_min,], digits = 2) knitr
shrinkage | interaction.depth | n.trees | RMSE | Rsquared |
---|---|---|---|---|
0.1 | 1 | 100 | 827.66 | 0.81 |
Here we compare the 4 models developed earlier. Each model was applied to a test set and the results were then used to calculate MSE. Below are the results.
<- predict(lm.fit, newdata = test)
lm_pred <- predict(lm.fit1, newdata = test)
lm_pred1 <- predict(rf_fit, newdata = test)
rf_pred <- predict(boost_fit, newdata = test)
boost_pred
<- as_tibble(cbind(lm_pred, lm_pred1, rf_pred, boost_pred))
prediction_values
<- mean((lm_pred - test$cnt)^2)
lm_MSE <- mean((lm_pred1 - test$cnt)^2)
lm_MSE1 <- mean((rf_pred - test$cnt)^2)
rf_MSE <- mean((boost_pred - test$cnt)^2)
boost_MSE
<- data.frame('Linear Model 1' = lm_MSE,
comp 'Linear Model 2' = lm_MSE1,
'Random Forest Model' = rf_MSE,
'Boosting Model' = boost_MSE)
::kable(t(comp), col.names = "MSE") knitr
MSE | |
---|---|
Linear.Model.1 | 501276.3 |
Linear.Model.2 | 428919.9 |
Random.Forest.Model | 825363.6 |
Boosting.Model | 657360.2 |
It was found that Linear.Model.2 achieves the lowest test MSE of 4.2891991^{5} for Tuesday data.
Below is a graph of the Actual vs Predicted results:
<- (which.min(t(comp)))
index_val
<- as_tibble(cbind("preds" = prediction_values[[index_val]], "actual" = test$cnt))
results_plot
ggplot(data = results_plot, aes(preds, actual)) + geom_point() +
labs(x = paste(names(which.min(comp)), "Predictions"), y = "Actual Values",
title = paste(names(which.min(comp)), "Actual vs Predicted Values")) +
geom_abline(slope = 1, intercept = 0, col = 'red')