People regularly quantify how much of a particular activity they do with wearable devices, but they rarely quantify how well they do it. This project seeks to determine the potential for real-time user feedback from model-based assessment of the quality, not quantity, of exercise technique.
To this end, six participants were asked to perform dumbbell lifts correctly and incorrectly in five different ways while accelerometers on the belt, forearm, arm, and dumbbell recorded their movements.
More information is available here (see the section on the Weight Lifting Exercise Dataset).
The data for this project kindly provided by Groupware@LES, and can be downloaded from their website (training data (csv, 12MB), test data (csv, 15 KB)).
A considerable amount of cleaning was needed. It is documented next; however, one can skip to the section on model building if not interested.
library(ggplot2)
library(caret)
library(plyr)
set.seed(42)
# Read in the data, suppressing the assignment of factors and dealing with missing values.
training <- read.csv("pml-training.csv",
stringsAsFactors = F,
na.strings = c("#DIV/0!","","NA"))
testing <- read.csv("pml-testing.csv",
stringsAsFactors = F,
na.strings = c("#DIV/0!","","NA"))
clean_data <- function(d){
# The data contain several unneccesary, non-sensor variables/features that need to be dropped. One problematic variable, "new_window", has levels "yes"/"no" in the training data, and only "no" in the testing data. "new_window", therefore, is likely a data flag the original researchers included to specify a subset of data for a purpose that is unknown (although it is likely these are summary statistics). Regardless, since "new_window" is a valid feature of only one dataset, all rows where "new_window" = "yes" will be removed and the feature dropped.
d <- d[d$new_window == "no",]
drop_vars <- c("X",
"raw_timestamp_part_1",
"raw_timestamp_part_2",
"cvtd_timestamp",
"new_window",
"num_window")
keep_vars <- !names(d) %in% drop_vars
d <- d[, keep_vars]
# Properly code factors:
all_possible_factor_vars <- c("user_name", "classe", "problem_id")
factor_vars <- names(d)[names(d) %in% all_possible_factor_vars]
for(i in factor_vars){
d[,i] <- as.factor(d[,i])
}
# Fix character fields which should be numeric:
chr2num_vars <- names(d)[unlist(lapply(d, function(x) is.character(x)))]
for(i in chr2num_vars){
d[,i] <- as.numeric(d[,i])
}
# Fix logical fields which should be numeric:
logical2num_vars <- names(d)[unlist(lapply(d, function(x) is.logical(x)))]
for(i in logical2num_vars){
d[,i] <- as.numeric(d[,i])
}
d
}
train_clean <- clean_data(training)
test_clean <- clean_data(testing)
# Some features/variables have no data in the test set. Furthermore, their names incicate that they are summary statistics (max_, min_, avg_, var_, etc.). Thus variables in the training set we limited to those in the test set.
no_test_data_var_names <- names(test_clean)[
unlist(lapply(test_clean, function(x)
sum(is.na(x)))) == nrow(test_clean)
]
valid_test_names <- names(test_clean)[
!names(test_clean) %in% no_test_data_var_names
]
train_valid <- train_clean[,names(train_clean) %in% c(valid_test_names, "classe")]
test_valid <- test_clean[, names(test_clean) %in% valid_test_names]
# The provided "testing" dataset does not acutally have the variable we are trying to predict ("classe") and so it is impossible to use it to test our model. Thus training, validation and testing datasets were created from the provided training dataset and reserve the provided test dataset for the final prediction exercise.
test_final <- test_valid
A complication to creating our data partitions is that there is clear evidence for dependence between predictor variables and test subjects. As well, there are an unequal number of observations of each test subject. For example, observe how half the subjects have “roll_belt” values consistently about zero while the other half are about 125-150.
ggplot(train_valid, aes(x = seq_along(row.names(train_valid)), y = roll_belt, col = user_name)) +
geom_jitter(shape = 1)
Thus, to avoid the chance of unintended bias data partitions were created by individual and then recombined. This guaranteed no user bias accross data partitions. Granted, a simple random reshuffling of the dataset prior to creating partitions would likely produce similar results.
list_of_data_partitions <- dlply(train_valid, "user_name", function(x){
inBuild <- createDataPartition(x[,"classe"],
p = 0.7)[[1]]
build_df <- x[inBuild,]
validation_df <- x[-inBuild,]
inTrain <- createDataPartition(build_df[,"classe"],
p = 0.7)[[1]]
train_df <- build_df[inTrain,]
test_df <- build_df[-inTrain,]
list(train_df = train_df,
validation_df = validation_df,
test_df = test_df)
}
)
train_df <- ldply(list_of_data_partitions, function(x) x[["train_df"]])
validation_df <- ldply(list_of_data_partitions, function(x) x[["validation_df"]])
test_df <- ldply(list_of_data_partitions, function(x) x[["test_df"]])
# Confirm it all went to plan (Note, average mean relative differences approached 5% when partitioning data irrespective of subjects):
c(all.equal(table(train_df$user_name)/nrow(train_df),
table(test_df$user_name)/nrow(test_df),
tolerance = 0.01),
all.equal(table(train_df$user_name)/nrow(train_df),
table(validation_df$user_name)/nrow(validation_df),
tolerance = 0.01))
## [1] TRUE TRUE
# Remove user_name now that it is no longer needed
remove_user <- function(x,var){
x[, !names(x) %in% var]
}
train_df <- remove_user(train_df,"user_name")
test_df <- remove_user(test_df,"user_name")
validation_df <- remove_user(validation_df,"user_name")
test_final <- remove_user(test_final,"user_name")
# Clean up environment
rm(list = ls()[!ls() %in% c("test_final", "train_df", "test_df", "validation_df")])
Different classes of models were considered as candidates to predict the classification of dumbbell lifts. Bagged (random forest, rf) and boosted (gradient boosting machine, gbm) models were assumed to be the best approaches given a classification problem with no need for model interpretability. These models also perform well with the type of data provided (a massive number of predictors, likely high in collinearity, interactions and thresholds), relative to regression techniques. However, regression (generalized linear model via penalized maximum likelihood, glmnet) and model-based prediction methods (linear discriminant analysis, lda) were considered for comparison and have the advantage of usually being faster (although this was not an important factor in choosing a model).
Repeated cross validation (10 folds, repeated 5 times) was chosen to estimate out of sample error using the training data as this gives an excellent balance of bias and variance. A large value of the tuning parameter (tuneLength = 10) was used to allow more predictors to be assessed at each split in the rf model. This increased runtime significantly, but testing demonstrated that it did produce lower estimates of out of sample error. Pre-processing the data was considered (centre, scale and PCA), but performed worse on all but regression models and so was only used for glmnet.
# Set up for all models ----
set.seed(321)
# nearZeroVar(train_df) # no vars with zero variance, continue.
# Use same training controls and CV folds from all models:
myFolds <- createFolds(train_df$classe,
k = 10,
list = T,
returnTrain = T)
trControl <- trainControl(method = "repeatedcv",
classProbs = T,
repeats = 5,
index = myFolds,
verboseIter = F,
savePredictions = T)
# Run all models ----
# Using caretEnsenble to unify options, caretList() creates a list of train() objects.
library(caretEnsemble)
library(ranger) # alternative to base rf model
model_list <- caretList(classe ~ .,
data = train_df,
methodList = c(lda = "lda",
gbm = "gbm",
rf = "ranger"),
tuneList = list(glmnet = caretModelSpec("glmnet",
preProcess = c("center",
"scale"))),
trControl = trControl,
tuneLength = 10)
# Compare model performance with a plot
resamps <- resamples(model_list)
dotplot(resamps)
Based on the dotplot, it appears the rf and gbm models perform well, with a slight advantage to rf with a predicted out of sample error rate of less than 3%.
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: glmnet, lda, gbm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.7158006 0.7278263 0.7376807 0.7402509 0.7523119 0.7709438 0
## lda 0.6829268 0.7003173 0.7049779 0.7062474 0.7118644 0.7359491 0
## gbm 0.9830508 0.9878377 0.9941709 0.9920564 0.9957582 0.9978836 0
## rf 0.9872881 0.9907309 0.9925769 0.9923732 0.9936373 0.9978836 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.6385857 0.6548940 0.6671087 0.6705442 0.6861302 0.7101125 0
## lda 0.5986320 0.6199868 0.6271502 0.6282016 0.6348497 0.6666440 0
## gbm 0.9785446 0.9846247 0.9926255 0.9899516 0.9946344 0.9973230 0
## rf 0.9839134 0.9882728 0.9906092 0.9903522 0.9919517 0.9973226 0
model_list$rf$finalModel
## Ranger result
##
## Call:
## ranger::ranger(dependent.variable.name = ".outcome", data = x, mtry = param$mtry, min.node.size = param$min.node.size, splitrule = as.character(param$splitrule), write.forest = TRUE, probability = classProbs, ...)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 9440
## Number of independent variables: 52
## Mtry: 29
## Target node size: 1
## Variable importance mode: none
## Splitrule: extratrees
## OOB prediction error (Brier s.): 0.02843636
A comparison of predictions for both models on the validation test set (not shown) confirms that the gbm model was also a candidate.
Finally, performace of the rf model was assessed on the test set where is was confirmed to perform well.
confusionMatrix(predict(model_list$rf,test_df),test_df$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1144 0 0 0 0
## B 2 776 3 0 1
## C 0 2 698 9 0
## D 0 0 1 650 6
## E 0 0 0 0 732
##
## Overall Statistics
##
## Accuracy : 0.994
## 95% CI : (0.9911, 0.9962)
## No Information Rate : 0.2848
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9925
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9983 0.9974 0.9943 0.9863 0.9905
## Specificity 1.0000 0.9982 0.9967 0.9979 1.0000
## Pos Pred Value 1.0000 0.9923 0.9845 0.9893 1.0000
## Neg Pred Value 0.9993 0.9994 0.9988 0.9973 0.9979
## Prevalence 0.2848 0.1933 0.1745 0.1638 0.1836
## Detection Rate 0.2843 0.1928 0.1735 0.1615 0.1819
## Detection Prevalence 0.2843 0.1943 0.1762 0.1633 0.1819
## Balanced Accuracy 0.9991 0.9978 0.9955 0.9921 0.9953
Our random forest (rf) model allowed the prediction of dumbbell lift classification with just over 99% accuracy. The approach to using wearable computing devices to determine the quality of exercise appears to have merit.