In this blog, I will introduce a R package for Heterogeneous Ensemble Learning (Classification, Regression) that is fully automated. It significantly lowers the barrier for the practitioners to apply heterogeneous ensemble learning techniques in an amateur fashion to their everyday predictive problems.
Before we dwell into the package details, let’s start with understanding a few basic concepts.
Why Ensemble Learning?
Generally, predictions become unreliable when the input sample is out of the training distribution, bias to data distribution or error prone to noise, and so on. Most approaches require changes to the network architecture, fine tuning, balanced data, increasing model size, etc. Further, the selection of the algorithm plays a vital role, while the scalability and learning ability decrease with the complex datasets. Combining multiple learners is an effective approach, and have been applied to many real-world problems. Ensemble learners combine a diverse collection of predictions from the individual base models to produce a composite predictive model that is more accurate and robust than its components. With meta ensemble learning one can minimize generalization error to some extent irrespective of the data distribution, number of classes, choice of algorithm, number of models, complexity of the datasets, etc. So, in summary, the predictive models will be able to generalize better.
How can we build models in more stable fashion while minimizing under-fitting/overfitting which is very critical to the overall outcome? The solution is ensemble meta-learning of a heterogeneous collection of base learners.
Common Ensemble Learning Techniques
The different popular ensemble techniques are referred to in the figure below. Stacked generalization is a general method of using a high-level model to combine lower- level models to achieve greater predictive accuracy. In the Bagging method, the independent base models are derived from the bootstrap samples of the original dataset. The Boosting method grows an ensemble in a dependent fashion iteratively, which adjusts the weight of an observation based on the past prediction. There are several extensions of bagging and boosting.
Overview
metaEnsembleR is an R package for automated meta-learning (Classification, Regression). The functionalities provided includes simple user input based predictive modeling with the selection choice of the algorithms, train-validation-test split, model valuations, and easy guided unseen data prediction which can help the user’s to build stack ensembles on the go. The core aim of this package is to cater the larger audiences in general. metaEnsembleR significantly lowers the barrier for the practitioners to apply heterogeneous ensemble learning techniques in an amateur fashion to their everyday predictive problems.
Using metaEnsembleR
The package consists of the following components:
- Ensemble Classifiers Training and Prediction
- Ensemble Regressor Training and Prediction
- Model Evaluation, Model Results (Observation vs. Prediction on test data) & new unseen data prediction and Disk write I/O performance charts & saving prediction results
All these functions are very intuitive, and their use is illustrated with examples below covering the Classification and Regression problem in general.
Getting Started
The package can be installed directly from CRAN
Install from Rconsole: install.packages(“metaEnsembleR”)
However, the latest stable version (if any) could be found on Github, and installed using devtools package.
Install from GitHub:
if(!require(devtools)) install.packages(“devtools”)
devtools::install_github(repo = ‘ajayarunachalam/metaEnsembleR’, ref = ‘main’)
Usage
library(“metaEnsembleR”)
set.seed(111)
- Training the ensemble classification model is as simple as one-line call to the ensembler.classifier function, in the following ways either passing the csv file directly or the imported dataframe, that takes into account the arguments in the following order starting the Dataset, Outcome/Response Variable index, Base Learners, Final Learner, Train-Validation-Test split ratio, and the Unseen data
ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, read.csv(‘./unseen_data.csv’))
OR
unseen_new_data_testing iris[130:150,]
ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing)
The above function returns the following, i.e., test data with the predictions, prediction labels, model result, and finally the predictions on unseen data.
testpreddata ← data.frame(ensembler_return[1])
table(testpreddata$actual_label)
table(ensembler_return[2])
#### Performance comparison #####
modelresult ← ensembler_return[3]
modelresult
#### Unseen data ###
unseenpreddata ← data.frame(ensembler_return[4])
table(unseenpreddata$unseenpreddata)
- Training the ensemble regression model is the same as one-line call to the ensembler.regression function, in the following ways either passing the csv file directly or the imported dataframe, that takes into account the arguments in the following order starting the Dataset, Outcome/Response Variable index, Base Learners, Final Learner, Train-Validation-Test split ratio, and the Unseen data
house_price ←read.csv(file = ‘./data/regression/house_price_data.csv’)
unseen_new_data_testing_house_price ←house_price[250:414,]
write.csv(unseen_new_data_testing_house_price, ‘unseen_house_price_regression.csv’, fileEncoding = ‘UTF-8’, row.names = F)
ensembler_return ← ensembler.regression(house_price[1:250,], 1, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, read.csv(‘./unseen_house_price_regression.csv’))
OR
ensembler_return ← ensembler.regression(house_price[1:250,], 1, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing_house_price )
The above function returns the following, i.e., test data with the predictions, prediction values, model result, and finally the unseen data with the predictions.
testpreddata ← data.frame(ensembler_return[1])
#### Performance comparison #####
modelresult ← ensembler_return[3]
modelresult
write.csv(modelresult[[1]], “performance_chart.csv”)
#### Unseen data ###
unseenpreddata ← data.frame(ensembler_return[4])
Examples
demo classification
library(“metaEnsembleR”)
attach(iris)
data(“iris”)
unseen_new_data_testing ← iris[130:150,]
write.csv(unseen_new_data_testing, ‘unseen_check.csv’, fileEncoding = ‘UTF-8’, row.names = F)
ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing)
testpreddata ← data.frame(ensembler_return[1])
table(testpreddata$actual_label)
table(ensembler_return[2])
####Performance comparison#####
modelresult ← ensembler_return[3]
modelresult
act_mybar ← qplot(testpreddata$actual_label, geom= “bar”)
act_mybar
pred_mybar ← qplot(testpreddata$predictions, geom= ‘bar’)
pred_mybar
act_tbl ← tableGrob(t(summary(testpreddata$actual_label)))
pred_tbl ← tableGrob(t(summary(testpreddata$predictions)))
ggsave(“testdata_actual_vs_predicted_chart.pdf”,grid.arrange(act_tbl, pred_tbl))
ggsave(“testdata_actual_vs_predicted_plot.pdf”,grid.arrange(act_mybar, pred_mybar))
####unseen data###
unseenpreddata ← data.frame(ensembler_return[4])
table(unseenpreddata$unseenpreddata)
table(unseen_new_data_testing$Species)
demo regression
library(“metaEnsembleR”)
data(“rock”)
unseen_rock_data ← rock[30:48,]
ensembler_return ← ensembler.regression(rock[1:30,], 4,c(‘lm’), ‘rf’, 0.40, 0.30, 0.30, unseen_rock_data)
testpreddata ← data.frame(ensembler_return[1])
####Performance comparison#####
modelresult ← ensembler_return[3]
modelresult
write.csv(modelresult[[1]], “performance_chart.csv”)
####unseen data###
unseenpreddata ← data.frame(ensembler_return[4])
Comprehensive Examples
More demo examples can be found in the Demo.R file, to see the results run Rscript Demo.R in the terminal.
Contact
If there is some implementation you would like to see here or add in some examples feel free to do so. You can always reach me at ajay.aruanchalam08@gmail.com
Always Keep Learning & Sharing Knowledge!!!
from Featured Blog Posts - Data Science Central https://ift.tt/3eEFdxM
via Gabe's MusingsGabe's Musings