sampledoc
News and Announcements »

Predicting Mislabeled Samples

This document describes how to use supervised classification to predict mislabeling of samples. Accidental mislabeling of samples due to human error, although rare, is nonetheless a real problem in large-scale studies. Supervised learning can be used to mitigate mislabeling of samples, as described previously (1). There are full instructions for running supervised learning in the Running Supervised Learning tutorial. If you suspect mislabeling in your data, you can use the following procedure to predict and remove mislabeled samples at varying levels of confidence.

Input files

This script requires a QIIME OTU table (or equivalent) and a QIIME metadata mapping file.

Estimating mislabeling probabilities

To run supervised classification on the QIIME tutorial data set, where the “Treatment” metadata column gives the class labels:

supervised_learning.py -i otu_table.biom -m Fasting_Map.txt -c Treatment -o ml -v

Removing mislabeled samples

One of the output files from supervised_learning.py is the file mislabeling.txt. This contains columns “mislabeled_probability_above_0.xx” for thresholds 5%, 10%, ..., 95%, 99%, indicating whether the probability that a given sample is mislabeled exceeds the given threshold. For example, if a particular sample has the value “TRUE” in the column “mislabeled_probability_above_0.95”, then the model has estimated that there is a 95% chance that the sample is mislabeled. Then, to remove samples from your data table that are predicted to have at least a 95% chance of being mislabeled, you would run:

filter_samples_from_otu_table.py -i otu_table.biom -m ml/mislabeling.txt -s 'mislabeled_probability_above_0.95:FALSE' -o otu_table_no_mislabeled.biom

Visualizing mislabeled samples

You can also visualize the predicted mislabels using make_3d_plots.py. Assuming that you have run beta_diversity.py and principal_coordinates.py to obtain a principal coordinates table pcoa.txt for your samples, you can use the following command to obtain a plot where samples are colored by their mislabeling status at a 90% threshold (i.e. 90% estimated probability of being mislabeled):

make_3d_plots.py -i pcoa.txt -m ml/mislabeling.txt -o color_by_mislabeling -b mislabeled_probability_above_0.90

The following command will make separate plots for all mislabeling thresholds:

make_3d_plots.py -i pcoa.txt -m mislabeling.txt -o color_by_mislabeling -b mislabeled_probability_above_0.05,mislabeled_probability_above_0.10,mislabeled_probability_above_0.15,mislabeled_probability_above_0.20,mislabeled_probability_above_0.25,mislabeled_probability_above_0.30,mislabeled_probability_above_0.35,mislabeled_probability_above_0.40,mislabeled_probability_above_0.45,mislabeled_probability_above_0.50,mislabeled_probability_above_0.55,mislabeled_probability_above_0.60,mislabeled_probability_above_0.65,mislabeled_probability_above_0.70,mislabeled_probability_above_0.75,mislabeled_probability_above_0.80,mislabeled_probability_above_0.85,mislabeled_probability_above_0.90,mislabeled_probability_above_0.95,mislabeled_probability_above_0.99

Cautions

Predicting mislabeled samples is challenging because (a) we don’t know the proportion of mislabeled samples ahead of time (in fact it is often zero); (b) we have to be able to distinguish the different types of labels with high accuracy; and (c) we have train a model to do (b) even when some of the training samples may be mislabeled. Therefore we recommend applying this approach only to data sets with a small number of well-characterized and distinguishable classes. We have found the Random Forests classifier to be robust to noisy (i.e. mislabeled) training data in several data sets (1), but we still recommend that you exercise caution when applying this technique. Here are some important steps that you can take to decrease the likelihood that you will be discarding correctly labeled samples:

  • Read the tutorial on running supervised learning.
  • After running supervised_learning.py, examine the output file summary.txt. If your estimated error is high (e.g. above 5% or 10%) this is an indication that your classes are not sufficiently distinct from one another to detect mislabeling, and you should not use the predictions of the model to remove mislabeled samples. In general, the more classes you have, and the more similar they are to one another, the more samples the model requires to be able to distinguish them. For example, you might be able to get good results with only 50 samples when you have two very distinct classes (e.g. gut microbiome versus skin microbiome), but you may need hundreds or thousands of samples to build an accurate predictive model if you have many very similar classes (e.g. distinguishing the gut microbiomes of a dozen individuals). The estimated error rate in summary.txt indicates how well the model was able to distinguish the classes in your data.
  • Unless you are certain that some of your data are mislabeled, you may want to use a very conservative threshold for identifying mislabeled samples. For example, you might choose to discard samples only if the model predicts their probability of being mislabeled to be 95% or higher.
  • This approach should be used to flag potentially mislabeled samples for further investigation. Whenever possible, return to the primary data source to verify and correct mislabeling.

Note: we recommend running single_rarefaction.py on your OTU table before using it as input to supervised_learning.py, to control for variation in sequencing effort.

References

[1]Knights, D., et al. (2010). Supervised classification of microbiota mitigates mislabeling errors. The ISME journal, 5(4), 570-573. (link)

sampledoc