pick_otus.py – OTU picking
The OTU picking step assigns similar sequences to operational taxonomic units, or OTUs, by clustering sequences based on a user-defined similarity threshold. Sequences which are similar at or above the threshold level are taken to represent the presence of a taxonomic unit (e.g., a genus, when the similarity threshold is set at 0.94) in the sequence collection.
Currently, the following clustering methods have been implemented in QIIME:
- cd-hit (Li & Godzik, 2006; Li, Jaroszewski, & Godzik, 2001), which applies a “longest-sequence-first list removal algorithm” to cluster sequences.
- blast (Altschul, Gish, Miller, Myers, & Lipman, 1990), which compares and clusters each sequence against a reference database of sequences.
- Mothur (Schloss et al., 2009), which requires an input file of aligned sequences. The input file of aligned sequences may be generated from an input file like the one described below by running align_seqs.py. For the Mothur method, the clustering algorithm may be specified as nearest-neighbor, furthest-neighbor, or average-neighbor. The default algorithm is furthest-neighbor.
- prefix/suffix [Qiime team, unpublished], which will collapse sequences which are identical in their first and/or last bases (i.e., their prefix and/or suffix). The prefix and suffix lengths are provided by the user and default to 50 each.
- Trie [Qiime team, unpublished], which collapsing identical sequences and sequences which are subsequences of other sequences.
- uclust (Edgar, RC 2010), creates “seeds” of sequences which generate clusters based on percent identity.
- uclust_ref (Edgar, RC 2010), as uclust, but takes a reference database to use as seeds. New clusters can be toggled on or off.
- usearch (Edgar, RC 2010, version v5.2.236), creates “seeds” of sequences which generate clusters based on percent identity, filters low abundance clusters, performs de novo and reference based chimera detection.
- usearch_ref (Edgar, RC 2010, version v5.2.236), as usearch, but takes a reference database to use as seeds. New clusters can be toggled on or off.
Quality filtering pipeline with usearch 5.X is described as usearch_qf “usearch quality filter”, described here: http://qiime.org/tutorials/usearch_quality_filter.html
- usearch61 (Edgar, RC 2010, version v6.1.544), creates “seeds” of sequences which generate clusters based on percent identity.
- usearch61_ref (Edgar, RC 2010, version v6.1.544), as usearch61, but takes a reference database to use as seeds. New clusters can be toggled on or off.
- sumaclust (Mercier, C. et al., 2014, version 1.0), creates “seeds” of sequences which generate clusters based on similarity threshold.
- sortmerna_v2 (Kopylova, E. et al., 2012), takes a reference database to use as seeds.
- swarm (Mahe, F. et al., 2014), creates “seeds” of sequences which generate clusters based on a resolution threshold.
Chimera checking with usearch 6.X is implemented in identify_chimeric_seqs.py. Chimera checking should be done first with usearch 6.X, and the filtered resulting fasta file can then be clustered.
The primary inputs for pick_otus.py are:
- A FASTA file containing sequences to be clustered
- An OTU threshold (default is 0.97, roughly corresponding to species-level OTUs);
- The method to be applied for clustering sequences into OTUs.
pick_otus.py takes a standard fasta file as input.
Usage: pick_otus.py [options]
- -i, --input_seqs_filepath
- Path to input sequences file
- -m, --otu_picking_method
- Method for picking OTUs. Valid choices are: sortmerna, mothur, trie, uclust_ref, usearch, usearch_ref, blast, usearch61, usearch61_ref, sumaclust, swarm, prefix_suffix, cdhit, uclust. The mothur method requires an input file of aligned sequences. usearch will enable the usearch quality filtering pipeline. [default: uclust]
- -c, --clustering_algorithm
- Clustering algorithm for mothur otu picking method. Valid choices are: furthest, nearest, average. [default: furthest]
- -M, --max_cdhit_memory
- Maximum available memory to cd-hit-est (via the program’s -M option) for cdhit OTU picking method (units of Mbyte) [default: 400]
- -o, --output_dir
- Path to store result file [default: ./<OTU_METHOD>_picked_otus/]
- -r, --refseqs_fp
- Path to reference sequences to search against when using -m blast, -m sortmerna, -m uclust_ref, -m usearch_ref, or -m usearch61_ref [default: /Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta]
- -b, --blast_db
- Pre-existing database to blast against when using -m blast [default: None]
- -e, --max_e_value_blast
- Max E-value when clustering with BLAST [default: 1e-10]
- Pre-existing database to search against when using -m sortmerna [default: None]
- Maximum E-value when clustering [default = 1]
- Mininum percent query coverage (of an alignment) to consider a hit, expressed as a fraction between 0 and 1 [default: 0.97]
- Output alignments in the Blast tabular format with two additional columns including the CIGAR string and the percent query coverage [default: False]
- Must be set together with –sortmerna_tabular. This option specifies how many alignments per read will be written [default: 1]
- The maximum number of positions per seed to store in the indexed database [default: 10000]
- Minimum percent of query sequence that can be aligned to consider a hit, expressed as a fraction between 0 and 1 (BLAST OTU picker only) [default: 0.5]
- -s, --similarity
- Sequence similarity threshold (for blast, cdhit, uclust, uclust_ref, usearch, usearch_ref, usearch61, usearch61_ref, sumaclust, and sortmerna) [default: 0.97]
- A sequence is assigned to the best matching seed rather than the first matching seed passing the similarity threshold [default: False]
- Reference sequence length if the shortest [default: True]
- OTU identifier prefix (string) for the de novo OTU pickers (sumaclust, swarm and uclust) [default: denovo, OTU ids are ascendingintegers]
- Maximum number of differences allowed between two amplicons, meaning that two amplicons will be grouped if they have integer (or less) differences (see Swarm manual at https://github.com/torognes/swarm for more details). [default: 1]
- -q, --trie_reverse_seqs
- Reverse seqs before picking OTUs with the Trie OTU picker for suffix (rather than prefix) collapsing [default: False]
- -n, --prefix_prefilter_length
- Prefilter data so seqs with identical first prefix_prefilter_length are automatically grouped into a single OTU. This is useful for large sequence collections where OTU picking doesn’t scale well [default: None; 100 is a good value]
- -t, --trie_prefilter
- Prefilter data so seqs which are identical prefixes of a longer seq are automatically grouped into a single OTU; useful for large sequence collections where OTU picking doesn’t scale well [default: False]
- -p, --prefix_length
- Prefix length when using the prefix_suffix otu picker; WARNING: CURRENTLY DIFFERENT FROM prefix_prefilter_length (-n)! [default: 50]
- -u, --suffix_length
- Suffix length when using the prefix_suffix otu picker [default: 50]
- -z, --enable_rev_strand_match
- Enable reverse strand matching for uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref otu picking, will double the amount of memory used. [default: False]
- -D, --suppress_presort_by_abundance_uclust
- Suppress presorting of sequences by abundance when picking OTUs with uclust or uclust_ref [default: False]
- -A, --optimal_uclust
- Pass the –optimal flag to uclust for uclust otu picking. [default: False]
- -E, --exact_uclust
- Pass the –exact flag to uclust for uclust otu picking. [default: False]
- -B, --user_sort
- Pass the –user_sort flag to uclust for uclust otu picking. [default: False]
- -C, --suppress_new_clusters
- Suppress creation of new clusters using seqs that don’t match reference when using -m uclust_ref, -m usearch61_ref, or -m usearch_ref [default: False]
- Max_accepts value to uclust, uclust_ref, usearch61, and usearch61_ref. By default, will use value suggested by method (uclust: 1, usearch61: 1) [default: default]
- Max_rejects value for uclust, uclust_ref, usearch61, and usearch61_ref. With default settings, will use value recommended by clustering method used (uclust: 8, usearch61: 8 for usearch_fast_cluster option, 32 for reference and smallmem options) [default: default]
- Stepwords value to uclust and uclust_ref [default: 8]
- Word length value for uclust, uclust_ref, and usearch, usearch_ref, usearch61, and usearch61_ref. With default setting, will use the setting recommended by the method (uclust: 8, usearch: 64, usearch61: 8). int value can be supplied to override this setting. [default: default]
- Don’t pass –stable-sort to uclust [default: False]
- Don’t collapse exact matches before calling sortmerna, sumaclust or uclust [default: False]
- -d, --save_uc_files
- Enable preservation of intermediate uclust (.uc) files that are used to generate clusters via uclust. Also enables preservation of all intermediate files created by usearch and usearch61. [default: True]
- -j, --percent_id_err
- Percent identity threshold for cluster error detection with usearch, expressed as a fraction between 0 and 1. [default: 0.97]
- -g, --minsize
- Minimum cluster size for size filtering with usearch. [default: 4]
- -a, --abundance_skew
- Abundance skew setting for de novo chimera detection with usearch. [default: 2.0]
- -f, --db_filepath
- Reference database of fasta sequences for reference based chimera detection with usearch. [default: None]
- Percent ID for mapping OTUs created by usearch back to original sequence IDs [default: 0.97]
- Deprecated: de novo chimera detection performed by default, pass –suppress_de_novo_chimera_detection to disable. [default: None]
- -k, --suppress_de_novo_chimera_detection
- Suppress de novo chimera detection in usearch. [default: False]
- Deprecated: Reference based chimera detection performed by default, pass –supress_reference_chimera_detection to disable [default: None]
- -x, --suppress_reference_chimera_detection
- Suppress reference based chimera detection in usearch. [default: False]
- Deprecated, cluster size filtering enabled by default, pass –suppress_cluster_size_filtering to disable. [default: None]
- -l, --suppress_cluster_size_filtering
- Suppress cluster size filtering in usearch. [default: False]
- Disable creation of logs when usearch is called. Up to nine logs are created, depending on filtering steps enabled. [default: False]
- Dereplication of full sequences, instead of subsequences. Faster than the default –derep_subseqs in usearch. [default: False]
- -F, --non_chimeras_retention
- Selects subsets of sequences detected as non-chimeras to retain after de novo and reference based chimera detection. Options are intersection or union. union will retain sequences that are flagged as non-chimeric from either filter, while intersection will retain only those sequences that are flagged as non-chimeras from both detection methods. [default: union]
- Minimum length of sequence allowed for usearch, usearch_ref, usearch61, and usearch61_ref. [default: 64]
- Use fast clustering option for usearch or usearch61_ref with new clusters. –enable_rev_strand_match can not be enabled with this option, and the only valid option for usearch61_sort_method is ‘length’. This option uses more memory than the default option for de novo clustering. [default: False]
- Sorting method for usearch61 and usearch61_ref. Valid options are abundance, length, or None. If the –usearch_fast_cluster option is enabled, the only sorting method allowed in length. [default: abundance]
- Enable size based preference in clustering with usearch61. Requires that –usearch61_sort_method be abundance. [default: False]
- Specify number of threads (1 thread per core) to be used for usearch61, sortmerna, sumaclust and swarm commands that utilize multithreading. [default: 1]
The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an (arbitrary) cluster identifier, and the remaining fields correspond to sequence identifiers assigned to that cluster. Sequence identifiers correspond to those provided in the input FASTA file. Usearch (i.e. usearch quality filter) can additionally have log files for each intermediate call to usearch.
Example lines from the resulting .txt file:
This result implies that four clusters were created based on 7 input sequences. The first cluster (cluster id 0) contains two sequences, sequence ids seq1 and seq5; the second cluster (cluster id 1) contains one sequence, sequence id seq2; the third cluster (cluster id 2) contains one sequence, sequence id seq3, and the final cluster (cluster id 3) contains three sequences, sequence ids seq4, seq6, and seq7.
The resulting .log file contains a list of parameters passed to the pick_otus.py script along with the output location of the resulting .txt file.
Example (uclust method, default):
Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory “picked_otus_default/”, while using default parameters (0.97 sequence similarity, no reverse strand matching):
pick_otus.py -i seqs.fna -o picked_otus_default
To change the percent identity to a lower value, such as 90%, and also enable reverse strand matching, the command would be the following:
pick_otus.py -i seqs.fna -o picked_otus_90_percent_rev/ -s 0.90 -z
Uclust Reference-based OTU picking example:
uclust_ref can be passed via -m to pick OTUs against a reference set where sequences within the similarity threshold to a reference sequence will cluster to an OTU defined by that reference sequence, and sequences outside of the similarity threshold to a reference sequence will form new clusters. OTU identifiers will be set to reference sequence identifiers when sequences cluster to reference sequences, and ‘qiime_otu_<integer>’ for new OTUs. Creation of new clusters can be suppressed by passing -C, in which case sequences outside of the similarity threshold to any reference sequence will be listed as failures in the log file, and not included in any OTU.
pick_otus.py -i seqs.fna -r refseqs.fasta -m uclust_ref --denovo_otu_id_prefix qiime_otu_
Example (cdhit method):
Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory “cdhit_picked_otus/”, while using default parameters (0.97 sequence similarity, no prefix filtering):
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus/
Currently the cd-hit OTU picker allows for users to perform a pre-filtering step, so that highly similar sequences are clustered prior to OTU picking. This works by collapsing sequences which begin with an identical n-base prefix, where n is specified by the -n parameter. A commonly used value here is 100 (e.g., -n 100). So, if using this filter with -n 100, all sequences which are identical in their first 100 bases will be clustered together, and only one representative sequence from each cluster will be passed to cd-hit. This is used to greatly decrease the run-time of cd-hit-based OTU picking when working with very large sequence collections, as shown by the following command:
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus_filter/ -n 100
Alternatively, if the user would like to collapse identical sequences, or those which are subsequences of other sequences prior to OTU picking, they can use the trie prefiltering (“-t”) option as shown by the following command.
Note: It is highly recommended to use one of the prefiltering methods when analyzing large datasets (>100,000 seqs) to reduce run-time.
pick_otus.py -i seqs.fna -m cdhit -o cdhit_picked_otus_trie_prefilter/ -t
BLAST OTU-Picking Example:
OTUs can be picked against a reference database using the BLAST OTU picker. This is useful, for example, when different regions of the SSU RNA have sequenced and a sequence similarity based approach like cd-hit therefore wouldn’t work. When using the BLAST OTU picking method, the user must supply either a reference set of sequences or a reference database to compare against. The OTU identifiers resulting from this step will be the sequence identifiers in the reference database. This allows for use of a pre-existing tree in downstream analyses, which again is useful in cases where different regions of the 16s gene have been sequenced.
The following command can be used to blast against a reference sequence set, using the default E-value and sequence similarity (0.97) parameters:
pick_otus.py -i seqs.fna -o blast_picked_otus/ -m blast -r refseqs.fasta
If you already have a pre-built BLAST database, you can pass the database prefix as shown by the following command:
pick_otus.py -i seqs.fna -o blast_picked_otus_prebuilt_db/ -m blast -b refseqs.fasta
If the user would like to change the sequence similarity (“-s”) and/or the E-value (“-e”) for the blast method, they can use the following command:
pick_otus.py -i seqs.fna -o blast_picked_otus_90_percent/ -m blast -r refseqs.fasta -s 0.90 -e 1e-30
Prefix-suffix OTU Picking Example:
OTUs can be picked by collapsing sequences which begin and/or end with identical bases (i.e., identical prefixes or suffixes). This OTU picker is currently likely to be of limited use on its own, but will be very useful in collapsing very similar sequences in a chained OTU picking strategy that is currently in development. For example, the user will be able to pick OTUs with this method, followed by representative set picking, and then re-pick OTUs on their representative set. This will allow for highly similar sequences to be collapsed, followed by running a slower OTU picker. This ability to chain OTU pickers is not yet supported in QIIME. The following command illustrates how to pick OTUs by collapsing sequences which are identical in their first 50 and last 25 bases:
pick_otus.py -i seqs.fna -o prefix_suffix_picked_otus/ -m prefix_suffix -p 50 -u 25
Mothur OTU Picking Example:
The Mothur program (http://www.mothur.org/) provides three clustering algorithms for OTU formation: furthest-neighbor (complete linkage), average-neighbor (group average), and nearest-neighbor (single linkage). Details on the algorithms may be found on the Mothur website and publications (Schloss et al., 2009). However, the running times of Mothur’s clustering algorithms scale with the number of sequences squared, so the program may not be feasible for large data sets.
The following command may be used to create OTUs based on a furthest-neighbor algorithm (the default setting) using aligned sequences as input:
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus/ -m mothur
If you prefer to use a nearest-neighbor algorithm instead, you may specify this with the ‘-c’ flag:
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus_nn/ -m mothur -c nearest
The sequence similarity parameter may also be specified. For example, the following command may be used to create OTUs at the level of 90% similarity:
pick_otus.py -i seqs.aligned.fna -o mothur_picked_otus_90_percent/ -m mothur -s 0.90
Usearch (http://www.drive5.com/usearch/) provides clustering, chimera checking, and quality filtering. The following command specifies a minimum cluster size of 2 to be used during cluster size filtering:
pick_otus.py -i seqs.fna -m usearch --word_length 64 --db_filepath refseqs.fasta -o usearch_qf_results/ --minsize 2
usearch example where reference-based chimera detection is disabled, and minimum cluster size filter is reduced from default (4) to 2:
pick_otus.py -i seqs.fna -m usearch --word_length 64 --suppress_reference_chimera_detection --minsize 2 -o usearch_qf_results_no_ref_chim_detection/
Use de novo OTU-picker Swarm:
Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory “$PWD/picked_otus_swarm/”, while using default parameters (resolution = 1)
pick_otus.py -i $PWD/seqs.fna -m swarm -o $PWD/picked_otus_swarm