News and Announcements » |
This tutorial covers how to retrain the RDP Classifier with an alternate taxonomy to use the RDP Classifier with arbitrary taxonomies. This is useful, for example, to assign greengenes taxonomy strings to your sequences, or to assign taxonomy to eukaryotic sequences using the Silva database.
This tutorial will illustrate an example where you’ve run the QIIME Overview Tutorial, and then want to re-assign taxonomy using the greengenes taxonomy. To do this you’ll need the greengenes reference OTUs. This is covered in the first step.
This tutorial assumes that you’ve already run the QIIME Overview Tutorial, and that you’re working in the directory where you ran the tutorial commands.
The most recent version of the greengenes OTUs is always available from the QIIME Resources page (click the Resources link on the left side of the QIIME homepage). As of this writing that is the 4feb2011 version, so we’ll illustrate commands working with that.
Download and unzip the greengenes reference OTUs:
wget http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Caporaso_Reference_OTUs/gg_otus_4feb2011.tgz
tar -xvzf gg_otus_4feb2011.tgz
Next you’ll retrain the RDP classifier and classify your sequences. You can use either assign_taxonomy.py or parallel_assign_taxonomy_rdp.py for this.
assign_taxonomy.py -i otus/rep_set/seqs_rep_set.fasta -t gg_otus_4feb2011/taxonomies/greengenes_tax_rdp_train.txt -r gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011.fasta -o otus/rdp_assigned_taxonomy_gg/ -m rdp
Next, you’ll rebuild the OTU table with the new taxonomic information.
make_otu_table.py -i otus/uclust_picked_otus/seqs_otus.txt -t otus/rdp_assigned_taxonomy_gg/seqs_rep_set_tax_assignments.txt -o otus/otu_table_gg.biom
That’s it. The resulting OTU table (otu_table_gg.biom) can now be used in downstream analyses, such as summarize_taxa_through_plots.py.
If you want to integrate retraining of the RDP classifier into your QIIME workflows, you can create a custom parameters file that can be used with the pick_de_novo_otus.py workflow script. If the gg_otus_4feb2011 directory is in $HOME/, the values in your custom parameters file would be:
assign_taxonomy:reference_seqs_fp $HOME/gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011.fasta
assign_taxonomy:id_to_taxonomy_fp $HOME/gg_otus_4feb2011/taxonomies/greengenes_tax_rdp_train.txt
assign_taxonomy:assignment_method rdp
Training files can be defined by users for other taxonomies. The format is the same as the id_to_taxonomy_map used by the BLAST taxonomy assigner, defined here. You must provide this file as well as a fasta file of reference sequences where the identifiers correspond to the ids in the id_to_taxonomy_map.
The RDP Classifier has several requirements about its taxonomy strings for retraining. The first column of this tab separated file is the sequence identifiers (see the reference sequence file below). The second column is the taxonomy strings in descending order of taxonomic specification, separated by semicolons. The number of taxonomic levels must be equal for every line.
An example set of four lines in the 4feb2011 greengenes OTUs that are valid for RDP retraining are:
573145 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__
89440 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__
222043 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Raoultella;s__Raoultellaornithinolytica
430240 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Serratia;s__Serratiamarcescens
The reference sequence file should have fasta labels that match all of the labels as listed in the taxonomy mapping file. Orientation of the sequences does not matter for RDP, but as this can impact other software such as uclust, so it is suggested that the sequences be in the same orientation to avoid complications with other QIIME scripts.
An example fasta file (with truncated nucleotide sequences) that matches the above taxonomy strings is:
>573145
AGAGTTTGATCATGGCTCAGATTGAACGCAGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCT
>89440
AGAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGC
>222043
AGAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGCACAGAAAGCTTACTC
>101567
TGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGAGGAGGAAGGCATTAAGGTTAATAACCTTAGTGATTGA