News and Announcements » |
This document describes how to run supervised classification with QIIME. The goal of supervised classification is to classify new, unlabeled communities based on a set of labeled training communities. See (1) for a general discussion of the application of supervised classification to microbiota. Supervised classification using the Random Forests (2) classifier is implemented in the QIIME script supervised_learning.py. When you run this script you will get several output files:
This script requires a QIIME OTU table (or equivalent) and a QIIME metadata mapping file.
To run supervised classification on the QIIME tutorial data set, where the “Treatment” metadata column gives the class labels:
supervised_learning.py -i otu_table.biom -m Fasting_Map.txt -c Treatment -o ml -v
All of the result files described above will be contained in the folder ml. The -v flag causes verbose output, including a trace of the classifier’s progress while it is running. This runs Random Forests with the default setting of 500 trees. For larger data sets, the expected generalization error may decrease slightly if more trees are used. You can run random forests with 1,000 trees with the following:
supervised_learning.py -i otu_table.biom -m Fasting_Map.txt -c Treatment -o ml -v --ntree 1000
Both of these example build a single random forests classifier, and use “out-of-bag” predictions (that is, each tree in the forest makes predictions about samples that were absent from its bootstrapped set of samples) to estimate the generalization error. If you have a very small data set you may wish to perform leave-one-out cross validation, in which the class label for each sample is predicted using a separate random forests classifier trained on the other n-1 samples:
supervised_learning.py -i otu_table.biom -m Fasting_Map.txt -c Treatment -o ml -v -e loo
To obtain more robust estimates of the generalization error and feature importances (including standard deviations), you can run the script with 5-fold or 10-fold cross validation:
supervised_learning.py -i otu_table.biom -m Fasting_Map.txt -c Treatment -o ml -v -e cv5
or
supervised_learning.py -i otu_table.biom -m Fasting_Map.txt -c Treatment -o ml -v -e cv10
Supervised classification is most useful for larger data sets. When data sets are too small, the estimates of the generalization error, feature importance, and class probabilities may be quite variable. How large a data set needs to be depends on, among other things, how subtle are the differences between classes, and how many noisy features (e.g. OTUs) there are.
Note: we recommend running single_rarefaction.py on your OTU table before using it as input to supervised_learning.py, to control for variation in sequencing effort.