For this project, I chose to focus on the Philippines, specifically the regions of Davao del Norte, Davao del Sur, Davao Oriental, and Compostela Valley. The linear regression, random forest, and support vector machine models that I ran all performed pretty well. The linear regression and random forest models both had very similar ME, MAE, and RMSE plots. Looking at the RMSE plots for both models, it is clear that overall, the models had little error, but had very high error in a few particular localities. The highest root mean squared error location, seen clearly in the 3D plots, appears to be the areas near the city of Mati, although it is hard to tell exactly. The support vector machine model, which I added as a stretch goal, does not account for spatial variation, but I did add the step of scaling the data using a simple min-max scaling function, so the RMSE is much lower, at 0.0791883. The R-squared for this model is also about 0.5, which is not that great but higher than I expected for such a simple model. Looking at the variable importance plot for the random forest model, it is clear that the NTL and DST190 variables were the most important predictors. I think this is a pretty logical result, because it makes sense that night time lights would correlate highly with population. It also follows that urban cover (dst190) would likewise correlate with higher centers of population. I think this is a good result because it shows that the model is correctly attributing the more important variables.
Population sums plot:
Diff sums plot:
Mean Error:
Mean Absolute Error:
Root Mean Squared Error (3D):
Cell stats (diff sums): 6016931
Population sums plot:
Diff sums plot:
Variable importance plot:
RF Mean Error:
RF Mean Absolute Error:
RF Root Mean Squared Error (3D):
Cell stats (diff sums): 5721193
Source: https://www.kaggle.com/thapelomola/svm-with-caret
For this model, I attempted to examine the effect of dependent variables on sum.pop15. Obviously this is a different approach than the linear regression and random forest models above because it does not incorporate the geospatial element. The R-squared for this model is 0.5085856, which is not terrible but not very good either. The RMSE is 0.0791883 and the MAE is 0.03668218.
library(caret)
library(MLmetrics)
data_split <- initial_split(data, prop = 4/5)
data_train <- training(data_split)
data_test <- testing(data_split)
TrainCtrl1 <- trainControl(method = "repeatedcv", number = 5,repeats=5,verbose = FALSE)
set.seed(512)
SVMgrid <- expand.grid(sigma = c(0.025), C = c(2))
X <- data_train[,-10]
X_scaled <- (X-min(X))/(max(X)-min(X))
Y <- data_train$sum.pop15
Y_scaled <- (Y-min(Y))/(max(Y)-min(Y))
x_test <- data_test[,-10]
x_test_scaled <-(x_test-min(x_test))/(max(x_test)-min(x_test))
y_test <- data_test$sum.pop15
y_test_scaled <- (y_test-min(y_test))/(max(y_test)-min(y_test))
modelSvmRRB <- train(X_scaled, Y_scaled, method="svmRadial", trControl=TrainCtrl1,tuneGrid = SVMgrid,preProc = c("scale","YeoJohnson"), verbose=FALSE)
PredictedTest <- predict(modelSvmRRB,x_test_scaled)
Accuracy(PredictedTest, y_test_scaled)
Output:
>>>modelSvmRRB
Support Vector Machines with Radial Basis Function Kernel
741 samples
12 predictor
Pre-processing: scaled (12), Yeo-Johnson transformation (12)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 593, 593, 593, 592, 593, 593, ...
Resampling results:
RMSE Rsquared MAE
0.07918833 0.5085856 0.03668218
Tuning parameter 'sigma' was held constant at a value of 0.025
Tuning parameter 'C' was held constant at a value of 2
SVM Model Mean Error Plot:
SVM Mean Absolute Error Plot:
SVM Root Mean Squared Error Plot (3D):