News and Announcements » |
Description:
This script is broken down into 4 possible OTU picking steps, and 2 steps involving the creation of OTU tables and trees. The commands for each step are described below, including what the input and resulting output files are. Additionally, the optional specified parameters of this script that can be passed are referenced.
Step 1) Prefiltering and picking closed reference OTUs The first step is an optional prefiltering of the input fasta file to remove sequences that do not hit the reference database with a given sequence identity (PREFILTER_PERCENT_ID). This step can take a very long time, so is disabled by default. The prefilter parameters can be changed with the options: –prefilter_refseqs_fp –prefilter_percent_id This filtering is accomplished by picking closed reference OTUs at the specified prefilter percent id to produce: prefilter_otus/seqs_otus.log prefilter_otus/seqs_otus.txt prefilter_otus/seqs_failures.txt prefilter_otus/seqs_clusters.uc Next, the seqs_failures.txt file is used to remove these failed sequences from the original input fasta file to produce: prefilter_otus/prefiltered_seqs.fna This prefiltered_seqs.fna file is then considered to contain the reads of the marker gene of interest, rather than spurious reads such as host genomic sequence or sequencing artifacts.
If prefiltering is applied, this step progresses with the prefiltered_seqs.fna. Otherwise it progresses with the input file. The Step 1 closed reference OTU picking is done against the supplied reference database. This command produces: step1_otus/_clusters.uc step1_otus/_failures.txt step1_otus/_otus.log step1_otus/_otus.txt
The representative sequence for each of the Step 1 picked OTUs are selected to produce: step1_otus/step1_rep_set.fna
Next, the sequences that failed to hit the reference database in Step 1 are filtered from the Step 1 input fasta file to produce: step1_otus/failures.fasta
Then the failures.fasta file is randomly subsampled to PERCENT_SUBSAMPLE of the sequences to produce: step1_otus/subsampled_failures.fna. Modifying PERCENT_SUBSAMPLE can have a big effect on run time for this workflow, but will not alter the final OTUs.
Step 2) The subsampled_failures.fna are next clustered de novo, and each cluster centroid is then chosen as a “new reference sequence” for use as the reference database in Step 3, to produce: step2_otus/subsampled_seqs_clusters.uc step2_otus/subsampled_seqs_otus.log step2_otus/subsampled_seqs_otus.txt step2_otus/step2_rep_set.fna
Step 3) Pick Closed Reference OTUs against Step 2 de novo OTUs Closed reference OTU picking is performed using the failures.fasta file created in Step 1 against the ‘reference’ de novo database created in Step 2 to produce: step3_otus/failures_seqs_clusters.uc step3_otus/failures_seqs_failures.txt step3_otus/failures_seqs_otus.log step3_otus/failures_seqs_otus.txt
Assuming the user has NOT passed the –suppress_step4 flag: The sequences which failed to hit the reference database in Step 3 are removed from the Step 3 input fasta file to produce: step3_otus/failures_failures.fasta
Step 4) Additional de novo OTU picking It is assumed by this point that the majority of sequences have been assigned to an OTU, and thus the sequence count of failures_failures.fasta is small enough that de novo OTU picking is computationally feasible. However, depending on the sequences being used, it might be that the failures_failures.fasta file is still prohibitively large for de novo clustering, and the jobs might take too long to finish. In this case it is likely that the user would want to pass the –suppress_step4 flag to avoid this additional de novo step.
A final round of de novo OTU picking is done on the failures_failures.fasta file to produce: step4_otus/failures_failures_cluster.uc step4_otus/failures_failures_otus.log step4_otus/failures_failures_otus.txt
A representative sequence for each cluster is chosen to produce: step4_otus/step4_rep_set.fna
Step 5) Produce the final OTU map and rep set If Step 4 is completed, the OTU maps from Step 1, Step 3, and Step 4 are concatenated to produce: final_otu_map.txt
If Step 4 was not completed, the OTU maps from Steps 1 and Step 3 are concatenated together to produce: final_otu_map.txt
Next, the minimum specified OTU size required to keep an OTU is specified with the –min_otu_size flag. For example, if the user left the –min_otu_size as the default value of 2, requiring each OTU to contain at least 2 sequences, the any OTUs which failed to meet this criteria would be removed from the final_otu_map.txt to produce: final_otu_map_mc2.txt
If –min_otu_size 10 was passed, it would produce: final_otu_map_mc10.txt
The final_otu_map_mc2.txt is used to build the final representative set: rep_set.fna
Step 6) Making the OTU tables and trees An OTU table is built using the final_otu_map_mc2.txt file to produce: otu_table_mc2.biom
As long as the –suppress_taxonomy_assignment flag is NOT passed, then taxonomy will be assigned to each of the representative sequences in the final rep_set produced in Step 5, producing: rep_set_tax_assignments.log rep_set_tax_assignments.txt This taxonomic metadata is then added to the otu_table_mc2.biom to produce: otu_table_mc_w_tax.biom
As long as the –suppress_align_and_tree is NOT passed, then the rep_set.fna file will be used to align the sequences and build the phylogenetic tree, which includes the de novo OTUs. Any sequences that fail to align are omitted from the OTU table and tree to produce: otu_table_mc_no_pynast_failures.biom rep_set.tre
If both –suppress_taxonomy_assignment and –suppress_align_and_tree are NOT passed, the script will produce: otu_table_mc_w_tax_no_pynast_failures.biom
It is important to remember that with a large workflow script like this that the user can jump into intermediate steps. For example, imagine that for some reason the script was interrupted on Step 2, and the user did not want to go through the process of re-picking OTUs as was done in Step 1. They can simply rerun the script and pass in the: –step_1_otu_map_fp –step1_failures_fasta_fp parameters, and the script will continue with Steps 2 - 4.
Note: If most or all of your sequences are failing to hit the reference during the prefiltering or closed-reference OTU picking steps, your sequences may be in the reverse orientation with respect to your reference database. To address this, you should add the following line to your parameters file (creating one, if necessary) and pass this file as -p:
pick_otus:enable_rev_strand_match True
Be aware that this doubles the amount of memory used in these steps of the workflow.
Usage: pick_open_reference_otus.py [options]
Input Arguments:
Note
[REQUIRED]
[OPTIONAL]
Output:
Run the subsampled open-reference OTU picking workflow on seqs1.fna using refseqs.fna as the reference collection and using sortmerna and sumaclust as the OTU picking methods. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will genenerally look like /home/ubuntu/my_analysis/
pick_open_reference_otus.py -i $PWD/seqs1.fna -r $PWD/refseqs.fna -o $PWD/ucrss_sortmerna_sumaclust/ -p $PWD/ucrss_smr_suma_params.txt -m sortmerna_sumaclust
Run the subsampled open-reference OTU picking workflow on seqs1.fna using refseqs.fna as the reference collection. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/
pick_open_reference_otus.py -i $PWD/seqs1.fna -r $PWD/refseqs.fna -o $PWD/ucrss/ -s 0.1 -p $PWD/ucrss_params.txt
Run the subsampled open-reference OTU picking workflow on seqs1.fna using refseqs.fna as the reference collection and using usearch61 and usearch61_ref as the OTU picking methods. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/
pick_open_reference_otus.py -i $PWD/seqs1.fna -r $PWD/refseqs.fna -o $PWD/ucrss_usearch/ -s 0.1 -p $PWD/ucrss_params.txt -m usearch61
Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/
pick_open_reference_otus.py -i $PWD/seqs1.fna,$PWD/seqs2.fna -r $PWD/refseqs.fna -o $PWD/ucrss_iter/ -s 0.1 -p $PWD/ucrss_params.txt
Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection. This is useful if you’re working with marker genes that do not result in useful alignment (e.g., fungal ITS). ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/
pick_open_reference_otus.py -i $PWD/seqs1.fna,$PWD/seqs2.fna -r $PWD/refseqs.fna -o $PWD/ucrss_iter_no_tree/ -s 0.1 -p $PWD/ucrss_params.txt --suppress_align_and_tree
Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection, suppressing assignment of taxonomy. This is useful if you’re working with a reference collection without associated taxonomy. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/
pick_open_reference_otus.py -i $PWD/seqs1.fna,$PWD/seqs2.fna -r $PWD/refseqs.fna -o $PWD/ucrss_iter_no_tax/ -s 0.1 -p $PWD/ucrss_params.txt --suppress_taxonomy_assignment