DESCRIPTION

v.class.mlR is a wrapper script that uses the R caret package for machine learning in R to classify features using training features by supervised learning.

The user provides a set of objects (or segments) to be classified, including all feature variables describing these object, and a set of objects to be used as training data, including the same feature variables as those describing the unknown objects, plus one additional column indicating the class each training falls into. The training data can, but does not have to be, a subset of the set of objects to be classified.

The user can provide input either as vector maps (segments_map and training_map, or as csv files (segments_file and training file, or a combination of both. Csv files have to be formatted in line with the default output of v.db.select, i.e. with a header. The field separator can be set with the separator parameter. Output can consist of either additional columns in the vector input map of features, a text file (classification_results) or reclassed raster maps (classified_map).

The user has to provide the name of the column in the training data that contains the class values (train_class_column), the prefix of the columns that will contain the final class after classification (output_class_column) as well as the prefix of the columns that will contain the probability values linked to these classifications (output_prob_column - see below).

Different classifiers are proposed classifiers: k-nearest neighbor (knn and knn1 for k=1), support vector machine with a radial kernel (svmRadial), random forest (rf) and recursive partitioning (rpart). Each of these classifiers is tuned automatically throught repeated cross-validation. caret will automatically determine a reasonable set of values for tuning. See the caret webpage for more information about the tuning parameters for each classifier, and more generally for the information about how caret works. By default, the module creates 10 5-fold partitions for cross-validation and tests 10 possible values of the tuning parameters. These values can be changed using, repectively, the partitions, folds and tunelength parameters.

The user can define a customized tunegrid for each classifier, using the tunegrids parameter. Any customized tunegrid has to be defined as a Python dictionary, with the classifiers as keys, and the input to expand.grid() as content as defined in the caret documentation.

For example, to define customized tuning grids for svmRadial and RandomForest, the user can define the paramter as:

tunegrids="{'svmRadial': 'sigma=c(0.01,0.05,0.1), C=c(1,16,128)', 'rf': 'mtry=c(3,10,20)'}"

The module can run the model tuning using parallel processing. In order for this to work, the R-package doParallel has to be installed. The processes parameter allows to chose the number of processes to run.

The user can chose to include the individual classifiers results in the output (the attributes and/or the raster maps) using the i flag, but by default the output will be the result of a voting scheme merging the results of the different classifiers. Votes can be weighted according to a user-defined mode (weighting_mode): simple majority vote without weighting, i.e. all weights are equal (smv), simple weighted majority vote (swv), best-worst weighted vote (bwwv) and quadratic best-worst weighted vote (qbwwv). For more details about these voting modes see Moreno-Seco et al (2006). By default, the weights are calculated based on the accuracy metric, but the user can chose the kappa value as an alternative (weighting_metric).

In the output (as attribute columns or text file) each weighting schemes result is provided accompanied by a value that can be considered as an estimation of the probability of the classification after weighted vote, based on equation (2) in Moreno et al (2006), page 709. At this stage, this estimation does not, however, take into account the probabilities determined individually by each classifier.

Optional output of the module include detailed information about the different classifier models and their cross-validation results model_details (for details of these results see the train, resamples and confusionMatrix.train functions in the caret package) a box-and-whisker plot indicating the resampling variance based on the cross-validation for each classifier (bw_plot_file) and a csv file containing accuracy measures (overall accuracy and kappa) for each classifier (accuracy_file). The user can also chose to write the R script constructed and used internally to a text file for study or further modification.

NOTES

The module can be used in a tool chain together with i.segment and the addon i.segment.stats for object-based classification of satellite imagery.

WARNING: The option output files are created by R and currently no checking is done of whether files of the same name already exist. If they exist, they are silently overwritten, regardless of whether the GRASS GIS --o flag is set or not.

The module makes no effort to check the input data for NA values or anything else that might perturb the analyses. It is up to the user to proceed to relevant checks before launching the module.

DEPENDENCIES

This module uses R. It is the user's responsibility to make sure R is installed and can be called from the environment this module is running in. See the relevant wiki page for more information. The module tries to install necessary R packages automatically if necessary. These include : 'caret', 'kernlab', 'e1071', 'randomForest', and 'rpart'. Other packages can be necessary such as 'ggplot2', 'lattice' (for the plots), and 'doParallel' (if parallel processing is desired).

TODO

EXAMPLE

Using existing vector maps as input and writing the output to the attribute table of the segments map, including the individual classifier results:

v.class.mlR segments_map=seg training_map=training train_class_column=class weighting_mode=smv,swv,qbwwv -i

Using text files with segment characteristics as input and writing output to raster files and a csv file

v.class.mlR segments_file=segstats.csv training_file=training.csv train_class_column=class weighting_mode=smv,swv,qbwwv raster_segments_map=seg classified_map=vote classification_results=class_results.csv

REFERENCES

Moreno-Seco, F. et al. (2006), Comparison of Classifier Fusion Methods for Classification in Pattern Recognition Tasks. In D.-Y. Yeung et al., eds. Structural, Syntactic, and Statistical Pattern Recognition. Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 705–713, http://dx.doi.org/10.1007/11815921_77.

SEE ALSO

i.segment, r.object.activelearning, r.learn.ml

AUTHOR

Moritz Lennert, Université Libre de Bruxelles (ULB) based on an initial R-script by Ruben Van De Kerchove, also ULB at the time

Last changed: $Date$