News and Announcements » |
Description:
The OTU picking step assigns similar sequences to operational taxonomic units, or OTUs, by clustering sequences based on a user-defined similarity threshold. Sequences which are similar at or above the threshold level are taken to represent the presence of a taxonomic unit (e.g., a genus, when the similarity threshold is set at 0.94) in the sequence collection.
Currently, the following clustering methods have been implemented in QIIME:
The primary inputs for pick_otus.py are:
pick_otus.py takes a standard fasta file as input.
Usage: pick_otus.py [options]
Input Arguments:
Note
[REQUIRED]
[OPTIONAL]
Output:
The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an (arbitrary) cluster identifier, and the remaining fields correspond to sequence identifiers assigned to that cluster. Sequence identifiers correspond to those provided in the input FASTA file. Usearch (i.e. usearch quality filter) can additionally have log files for each intermediate call to usearch.
Example lines from the resulting .txt file:
0 | seq1 | seq5 | |
1 | seq2 | ||
2 | seq3 | ||
3 | seq4 | seq6 | seq7 |
This result implies that four clusters were created based on 7 input sequences. The first cluster (cluster id 0) contains two sequences, sequence ids seq1 and seq5; the second cluster (cluster id 1) contains one sequence, sequence id seq2; the third cluster (cluster id 2) contains one sequence, sequence id seq3, and the final cluster (cluster id 3) contains three sequences, sequence ids seq4, seq6, and seq7.
The resulting .log file contains a list of parameters passed to the pick_otus.py script along with the output location of the resulting .txt file.
Example (uclust method, default):
Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory “picked_otus_default/”, while using default parameters (0.97 sequence similarity, no reverse strand matching):
pick_otus.py -i seqs.fna -o picked_otus_default
To change the percent identity to a lower value, such as 90%, and also enable reverse strand matching, the command would be the following:
pick_otus.py -i seqs.fna -o picked_otus_90_percent_rev/ -s 0.90 -z
Uclust Reference-based OTU picking example:
uclust_ref can be passed via -m to pick OTUs against a reference set where sequences within the similarity threshold to a reference sequence will cluster to an OTU defined by that reference sequence, and sequences outside of the similarity threshold to a reference sequence will form new clusters. OTU identifiers will be set to reference sequence identifiers when sequences cluster to reference sequences, and ‘qiime_otu_<integer>’ for new OTUs. Creation of new clusters can be suppressed by passing -C, in which case sequences outside of the similarity threshold to any reference sequence will be listed as failures in the log file, and not included in any OTU.
pick_otus.py -i seqs.fna -r refseqs.fasta -m uclust_ref --uclust_otu_id_prefix qiime_otu_
Example (cdhit method):
Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory “cdhit_picked_otus/”, while using default parameters (0.97 sequence similarity, no prefix filtering):
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus/
Currently the cd-hit OTU picker allows for users to perform a pre-filtering step, so that highly similar sequences are clustered prior to OTU picking. This works by collapsing sequences which begin with an identical n-base prefix, where n is specified by the -n parameter. A commonly used value here is 100 (e.g., -n 100). So, if using this filter with -n 100, all sequences which are identical in their first 100 bases will be clustered together, and only one representative sequence from each cluster will be passed to cd-hit. This is used to greatly decrease the run-time of cd-hit-based OTU picking when working with very large sequence collections, as shown by the following command:
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus_filter/ -n 100
Alternatively, if the user would like to collapse identical sequences, or those which are subsequences of other sequences prior to OTU picking, they can use the trie prefiltering (“-t”) option as shown by the following command.
Note: It is highly recommended to use one of the prefiltering methods when analyzing large datasets (>100,000 seqs) to reduce run-time.
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus_trie_prefilter/ -t
BLAST OTU-Picking Example:
OTUs can be picked against a reference database using the BLAST OTU picker. This is useful, for example, when different regions of the SSU RNA have sequenced and a sequence similarity based approach like cd-hit therefore wouldn’t work. When using the BLAST OTU picking method, the user must supply either a reference set of sequences or a reference database to compare against. The OTU identifiers resulting from this step will be the sequence identifiers in the reference database. This allows for use of a pre-existing tree in downstream analyses, which again is useful in cases where different regions of the 16s gene have been sequenced.
The following command can be used to blast against a reference sequence set, using the default E-value and sequence similarity (0.97) parameters:
pick_otus.py -i seqs.fna -o blast_picked_otus/ -m blast -r refseqs.fasta
If you already have a pre-built BLAST database, you can pass the database prefix as shown by the following command:
pick_otus.py -i seqs.fna -o blast_picked_otus_prebuilt_db/ -m blast -b refseqs.fasta
If the user would like to change the sequence similarity (“-s”) and/or the E-value (“-e”) for the blast method, they can use the following command:
pick_otus.py -i seqs.fna -o blast_picked_otus_90_percent/ -m blast -r refseqs.fasta -s 0.90 -e 1e-30
Prefix-suffix OTU Picking Example:
OTUs can be picked by collapsing sequences which begin and/or end with identical bases (i.e., identical prefixes or suffixes). This OTU picker is currently likely to be of limited use on its own, but will be very useful in collapsing very similar sequences in a chained OTU picking strategy that is currently in development. For example, the user will be able to pick OTUs with this method, followed by representative set picking, and then re-pick OTUs on their representative set. This will allow for highly similar sequences to be collapsed, followed by running a slower OTU picker. This ability to chain OTU pickers is not yet supported in QIIME. The following command illustrates how to pick OTUs by collapsing sequences which are identical in their first 50 and last 25 bases:
pick_otus.py -i seqs.fna -o prefix_suffix_picked_otus/ -m prefix_suffix -p 50 -u 25
Mothur OTU Picking Example:
The Mothur program (http://www.mothur.org/) provides three clustering algorithms for OTU formation: furthest-neighbor (complete linkage), average-neighbor (group average), and nearest-neighbor (single linkage). Details on the algorithms may be found on the Mothur website and publications (Schloss et al., 2009). However, the running times of Mothur’s clustering algorithms scale with the number of sequences squared, so the program may not be feasible for large data sets.
The following command may be used to create OTUs based on a furthest-neighbor algorithm (the default setting) using aligned sequences as input:
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus/ -m mothur
If you prefer to use a nearest-neighbor algorithm instead, you may specify this with the ‘-c’ flag:
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus_nn/ -m mothur -c nearest
The sequence similarity parameter may also be specified. For example, the following command may be used to create OTUs at the level of 90% similarity:
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus_90_percent/ -m mothur -s 0.90
Usearch_qf (‘usearch quality filter’):
Usearch (http://www.drive5.com/usearch/) provides clustering, chimera checking, and quality filtering. The following command specifies a minimum cluster size of 2 to be used during cluster size filtering:
pick_otus.py -i seqs.fna -m usearch --word_length 64 --db_filepath refseqs.fasta -o usearch_qf_results/ --minsize 2
Usearch (usearch_qf) example where reference-based chimera detection is disabled, and minimum cluster size filter is reduced from default (4) to 2:
pick_otus.py -i seqs.fna -m usearch --word_length 64 --suppress_reference_chimera_detection --minsize 2 -o usearch_qf_results_no_ref_chim_detection/