News and Announcements » |
This document describes how to use supervised classification to predict mislabeling of samples. Accidental mislabeling of samples due to human error, although rare, is nonetheless a real problem in large-scale studies. Supervised learning can be used to mitigate mislabeling of samples, as described previously (1). There are full instructions for running supervised learning in the Running Supervised Learning tutorial. If you suspect mislabeling in your data, you can use the following procedure to predict and remove mislabeled samples at varying levels of confidence.
This script requires a QIIME OTU table (or equivalent) and a QIIME metadata mapping file.
To run supervised classification on the QIIME tutorial data set, where the “Treatment” metadata column gives the class labels:
supervised_learning.py -i otu_table.biom -m Fasting_Map.txt -c Treatment -o ml -v
One of the output files from supervised_learning.py is the file mislabeling.txt. This contains columns “mislabeled_probability_above_0.xx” for thresholds 5%, 10%, ..., 95%, 99%, indicating whether the probability that a given sample is mislabeled exceeds the given threshold. For example, if a particular sample has the value “TRUE” in the column “mislabeled_probability_above_0.95”, then the model has estimated that there is a 95% chance that the sample is mislabeled. Then, to remove samples from your data table that are predicted to have at least a 95% chance of being mislabeled, you would run:
filter_samples_from_otu_table.py -i otu_table.biom -m ml/mislabeling.txt -s 'mislabeled_probability_above_0.95:FALSE' -o otu_table_no_mislabeled.biom
You can also visualize the predicted mislabels using make_emperor.py. Assuming that you have run beta_diversity.py and principal_coordinates.py to obtain a principal coordinates table pcoa.txt for your samples, you can use the following command to obtain a plot where samples are colored by their mislabeling status:
make_emperor.py -i pcoa.txt -m ml/mislabeling.txt -o color_by_mislabeling
Predicting mislabeled samples is challenging because (a) we don’t know the proportion of mislabeled samples ahead of time (in fact it is often zero); (b) we have to be able to distinguish the different types of labels with high accuracy; and (c) we have train a model to do (b) even when some of the training samples may be mislabeled. Therefore we recommend applying this approach only to data sets with a small number of well-characterized and distinguishable classes. We have found the Random Forests classifier to be robust to noisy (i.e. mislabeled) training data in several data sets (1), but we still recommend that you exercise caution when applying this technique. Here are some important steps that you can take to decrease the likelihood that you will be discarding correctly labeled samples:
Note: we recommend running single_rarefaction.py on your OTU table before using it as input to supervised_learning.py, to control for variation in sequencing effort.