News and Announcements » |
This document describes how to use QIIME to analyze very large data sets, generally on the HiSeq2000, using an open-reference OTU picking protocol. There are two primary options here: first is to use standard open-reference uclust (i.e., uclust_ref allowing for new clusters); and second is to use the subsampled open-reference OTU picking protocol. The results from both will be nearly identical. The difference is that the first option is slower, but available in QIIME 1.4.0 (and most earlier versions). The runtime will be limiting for multiple HiSeq2000 lanes, or for single HiSeq2000 lanes with high diversity. The second option is much faster, and will work for multiple HiSeq2000 lanes or runs. It is available in QIIME 1.5.0 and later.
This document very briefly covers option 1. Most of the text covers option 2, including a description of how to use it, what exactly is happening, and test results from applying this to some well-understood 454 data.
Note
You can always find a link to the latest version of the Greengenes reference OTUs and the AMI of the latest QIIME EC2 instances here.
Standard open-reference OTU picking is suitable for a single HiSeq2000 lane (unless it’s very high diversity, in which case runtime may be a limiting factor). You’ll use the pick_otus_through_otu_table.py workflow script in QIIME with a custom parameters file.
Your parameters file should look like the following:
pick_otus:otu_picking_method uclust_ref
pick_otus:refseqs_fp <PATH TO REFERENCE COLLECTION>
pick_otus:enable_rev_strand_match True
Where <PATH TO REFERENCE COLLECTION> is replaced with the path to the reference data set you’d like to pick OTUs against. The QIIME development group frequently used the Greengenes reference OTUs. On the QIIME EC2 instance, this path would be /software/gg_otus-4feb2011-release/rep_set/gg_97_otus_4feb2011.fasta.
This command should be run in parallel. Each job will need approximately 4GB of RAM, so if running on EC2 and you want to start 8 parallel jobs (recommended setting for EC2), your instance type should be m2.4xlarge.
You can then use the following commands. You should always use full paths which are represented here by $PWD, but will usually look something like /home/ubuntu/my_data/ (in other words, they should start with a /). In this example your input sequences (seqs.fna), your parameters file (ucr_params.txt), and your metadata mapping file (map.txt) are all in the same directory represented by $PWD. If you work from the directory containing those files, you can leave $PWD in the commands instead of specifying the full paths.
First, pick otus, choose representative sequences, assign taxonomy to OTUs, and build a phylogenetic tree. The -aO 8 specifies that we want to start 8 parallel jobs - adjust this according to the resources you have available. This is open-reference OTU picking, so reads will be clustered against the reference database (in parallel) and reads which fail to hit the reference data set will subsequently be clustered de novo (serially):
pick_otus_through_otu_table.py -i $PWD/seqs.fna -o $PWD/ucr/ -p $PWD/ucr_params.txt -aO 8
When working with Illumina data you typically want to filter singleton OTUs (i.e., OTUs with only one sequence) as these are likely to represent sequencing or PCR errors. In QIIME 1.4.0 (and most earlier versions) you can do that with this command:
filter_otu_table.py -i $PWD/ucr/otu_table.txt -o $PWD/ucr/otu_table_mc2.txt -c 2 -s 1
In QIIME 1.4.0-dev and later, you can filter singleton OTUs with this command:
filter_otus_from_otu_table.py -i $PWD/ucr/otu_table.biom -o $PWD/ucr/otu_table_mc2.biom -c 2
You’ll notice that depending on your version of QIIME, the extension on your OTU table file will differ. In QIIME 1.4.0 and earlier, it will be .txt. In QIIME 1.4.0-dev and later it will be .biom. We’ll continue this example assuming that your OTU table ends with .txt (if you’re working with QIIME 1.4.0-dev or later, you likely decided to go with option 2 for OTU picking).
As PCoA of UniFrac distances between samples is a frequent result of interest in microbial ecology, we’ll cover how to generate PCoA plots next. The first thing you’ll want to do is evenly sample your OTU table. To choose an even sampling depth, review the number of reads per sample:
per_library_stats.py -i $PWD/ucr97/otu_table_mc2.biom
This will print information on the number of reads per sample to the terminal. Choose a depth of sampling that maximizes the number of sequences you’ll include, and also the number of samples that have at least that many sequences: samples with fewer sequences will be excluded from your beta diversity/PCoA analysis. Even sampling is absolutely critical to getting meaningful UniFrac distances between your samples.
After choosing an even sampling depth you can use the beta_diversity_through_plots.py script to rarify your OTU table, compute UniFrac distances between samples, and run Principal Coordinates Analysis with the following command (in this example we have chosen an even sampling depth of 25,000 sequences/sample):
beta_diversity_through_plots.py -i $PWD/ucr/otu_table_mc2.biom -e 25000 -o $PWD/ucr/bdiv_even25000/ -t $PWD/ucr/rep_set.tre -m $PWD/map.txt -aO8
Again the -aO8 specifies that the job should be run in parallel on 8 processors. Adjust this according to your resources. When this completes you can open 3D interactive PCoA plots by opening the file bdiv_even25000/unweighted_unifrac_3d_continuous/unweighted_unifrac_pc_3D_PCoA_plots.html in a web browser. This may take a minute to load for very large sets of samples.
Subsampled reference OTU picking is suitable for any analysis that standard open-reference OTU picking can be used for, but additionally scales to much larger data sets (such as multiple HiSeq runs, which may require several days on ~100 processors to analyze).
Warning
If processing multiple HiSeq lanes, don’t combine the sequence data into a single file. Instead, see Using the subsampled open-reference OTU picking workflow in iterative mode.
This is an open-reference OTU picking protocol, meaning that sequences are clustered against a reference database, and reads which fail to hit the reference are subsequently clustered de novo. This differs from standard open-reference OTU picking as it was optimized at several steps to enable running on massive numbers of sequences (hundreds of millions, which is massive as of this writing). The steps in this workflow are as follows.
Prefilter the input sequence collection by searching reads against the reference set with a low percent identity threshold (default is 60%, modify with --prefilter_percent_id). The choice of 60% is described here. All reads which fail to hit the reference set are discarded as likely sequencing error.
Warning
If most or all of your sequences are being filtered at this step, your sequences may be in the reverse orientation with respect to your reference database. To address this, you should add the following to your parameters file (creating one, if necessary) and pass this file as -p to pick_subsampled_reference_otus_through_otu_table.py: pick_otus:enable_rev_strand_match True. This is included in the instructions below, but be aware that this doubles the memory used in this step of the workflow.
Apply closed-reference OTU picking against the reference collection. Generate a fasta file containing all reads that fail to hit the reference collection.
Randomly subsample the sequences that failed to hit the reference collection, and write these to a new fasta file (default subsampling percentage is 0.1%, modify with -s/--percent_subsample). Cluster these reads de novo, and choose a representative set of sequences as the centroid of each OTU cluster. These are the new reference OTUs.
Pick closed-reference OTUs against the representative sequences from the previous step. Write all sequences that fail to hit the reference collection to a fasta file.
Pick de novo OTUs on all reads that failed to hit the reference collection in the previous step. These are the clean-up OTUs. This step can be suppress by passing --suppress_step4.
To apply this analysis to seqs1.fna, picking OTUs against the reference collection refseqs.fna you can run the following command. Your parameters file should contain the following:
pick_otus:otu_picking_method uclust_ref
pick_otus:enable_rev_strand_match True
You should always use full paths which are represented here by $PWD, but will usually look something like /home/ubuntu/my_data/ (in other words, they should start with a /). In this example your input sequences (seqs1.fna), and your metadata mapping file (map.txt) are all in the same directory represented by $PWD. If you work from the directory containing those files, you can leave $PWD in the commands instead of specifying the full paths:
pick_subsampled_reference_otus_through_otu_table.py -i $PWD/seqs1.fna -r $PWD/refseqs.fna -o $PWD/ucrss/ -aO 8 -p $PWD/ucrss_params.txt
This command should be run in parallel. Each job will need approximately 4GB of RAM, so if running on EC2 and you want to start 8 parallel jobs (recommended setting for EC2), your instance type should be m2.4xlarge. The -aO 8 specifies that we want to start 8 parallel jobs - adjust this according to the resources you have available.
As PCoA of UniFrac distances between samples is a frequent result of interest in microbial ecology, we’ll cover how to generate PCoA plots next. The first thing you’ll want to do is evenly sample your OTU table. To choose an even sampling depth, review the number of reads per sample:
per_library_stats.py -i $PWD/ucrss/otu_table_mc2_w_tax_no_pynast_failures.biom
This will print information on the number of reads per sample to the terminal. Choose a depth of sampling that maximizes the number of sequences you’ll include, and also the number of samples that have at least that many sequences: samples with fewer sequences will be excluded from your beta diversity/PCoA analysis. Even sampling is absolutely critical to getting meaningful UniFrac distances between your samples.
After choosing an even sampling depth you can use the beta_diversity_through_plots.py script to rarify your OTU table, compute UniFrac distances between samples, and run Principal Coordinates Analysis with the following command (in this example we have chosen an even sampling depth of 25,000 sequences/sample):
beta_diversity_through_plots.py -i $PWD/ucrss/otu_table_mc2_w_tax_no_pynast_failures.biom -e 25000 -o $PWD/ucrss/bdiv_even25000/ -t $PWD/ucrss/rep_set.tre -m $PWD/map.txt -aO8
Again the -aO8 specifies that the job should be run in parallel on 8 processors. Adjust this according to your resources. When this completes you can open 3D interactive PCoA plots by opening the file bdiv_even25000/unweighted_unifrac_3d_continuous/unweighted_unifrac_pc_3D_PCoA_plots.html in a web browser. This may take a minute to load for very large sets of samples.
There are cases where you may be interested in working with the closed reference subset of your open reference OTU table (meaning only those OTUs that hit the reference collection, excluding the new OTUs). Following from the above commands, to do that you can filter the new OTUs from the OTU table with the following command:
filter_otus_from_otu_table.py -i $PWD/ucrss/otu_table_mc2_w_tax_no_pynast_failures.biom -o $PWD/ucrss/otu_table_mc2_w_tax_no_pynast_failures.reference_only.biom --negate_ids_to_exclude -e $PWD/refseqs.fna
What this does is filter exclude all OTUs with identifiers that are not present in $PWD/refseqs.fna, so all of the new OTUs.
Several analyses were performed to confirm that results are comparable between the subsampled open-reference OTU picking workflow and the standard open-reference OTU picking workflow. These include analyses on two different data sets: one host-associated (the Costello Whole Body study) and one free-living (the Lauber 88 soils study). These two were chosen as Greengenes (the reference set being used) is known to be biased toward human-associated microbes, so I wanted to confirm that the method works when few sequences fail to hit the reference set (whole body) and when many sequences fail to hit the reference set (88 soils).
For all analyses, sequences that fail to align with PyNAST and singleton OTUs were discarded (these are defaults in the subsampled OTU picking workflow).
This analysis is based on the data presented in the Lauber 88 soils study.
Here I checked whether the subsampled reference OTU alpha diversities for all samples were correlated with the de novo OTU picking, standard open-reference OTU picking, and closed-reference OTU picking alpha diversities. The observed species/OTUs metric was calculated on add data sets (alpha_diversity.py -m observed_species), and the Pearson correlations were computed for subsampled reference OTU picking with the three other sets of values.
- subsampled open-reference OTU picking versus de novo OTU picking: r=0.995 p=4.836e-88
- subsampled open-reference OTU picking versus standard open-reference OTU picking: r=1.000 p=0.000
- subsampled open-reference OTU picking versus closed-reference OTU picking: r=0.8634 p=1.405e-27
Subsampled open-reference OTU picking alpha diversity values are significantly correlated with de novo, standard open-reference, and closed-reference OTU picking results. This suggests that we will derive the same biological conclusions between regarding alpha diversity when using the subsampled OTU picking workflow.
Here I checked whether Procrustes comparisons of unweighted UniFrac PCoA plots between subsampled open-reference OTU picking and de novo OTU picking, standard open-reference OTU picking, and closed-reference OTU picking yield significant results. This was calculated using transform_coordinate_matrices.py which is described in the Procrustes tutorial. p-values are based on 1000 Monte Carlo iterations.
- subsampled open-reference OTU picking versus de novo OTU picking: M2=0.009 p<0.001
- subsampled open-reference OTU picking versus standard open-reference OTU picking: M2=0.007 p<0.001
- subsampled open-reference OTU picking versus closed-reference OTU picking: M2=0.039 p<0.001
Procrustes results are highly significant for the three comparisons, suggesting that we will derive the same biological conclusions regardless of which of these OTU picking workflows is used.
Here I confirm that the same taxonomy groups are identified as significantly different across the pH gradient in these soils using ANOVA, regardless of which OTU picking workflow is applied. These results were computed with the otu_category_significance.py script. To define a category for this test I binned the pH values by truncating the values to integers (so 5.0, 5.3, and 5.9 are all binned to pH 5) and using this binned pH as the category. Since I’m just looking for consistent results across the different OTU picking methods I don’t think it’s important that this isn’t the most biologically relevant binning strategy. Note that OTU ids are not directly comparable across all analyses, so it is best to compare the taxonomies.
Top 5 OTUs that differ across bins for subsampled open-reference OTU picking:
OTU ID | Bonferroni-adjusted p-value | Taxonomy |
---|---|---|
New.CleanUp.ReferenceOTU26927 | 1.99e-11 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__Sinobacteraceae |
New.CleanUp.ReferenceOTU34053 | 7.06e-09 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae |
212596 | 9.26e-09 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae |
112859 | 1.22e-08 | k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Hyphomicrobiaceae |
New.CleanUp.ReferenceOTU36189 | 4.35e-08 | k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Rubrobacterales; f__Rubrobacteraceae |
Top 5 OTUs that differ across bins for de novo OTU picking:
OTU ID | Bonferroni-adjusted p-value | Taxonomy |
---|---|---|
26819 | 3.19e-11 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__Sinobacteraceae |
28062 | 5.92e-10 | k__Bacteria; p__Acidobacteria; c__; o__; f__Koribacteraceae |
35264 | 2.43e-09 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae |
45059 | 5.48e-09 | k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__; f__ |
7687 | 2.056e-08 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae |
Top 5 OTUs that differ across bins for standard open-reference OTU picking:
OTU ID | Bonferroni-adjusted p-value | Taxonomy |
---|---|---|
DeNovoOTU26928 | 1.99e-11 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__Sinobacteraceae |
DeNovoOTU34054 | 7.06e-09 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae |
212596 | 9.26e-09 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae |
112859 | 1.22e-08 | k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Hyphomicrobiaceae |
DeNovoOTU36190 | 4.35e-08 | k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Rubrobacterales; f__Rubrobacteraceae |
Top 5 OTUs that differ across bins for closed-reference OTU picking:
OTU ID | Bonferroni-adjusted p-value | Taxonomy |
---|---|---|
212596 | 4.03e-09 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae; g__CandidatusSolibacter; s__ |
112859 | 4.62e-09 | k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__; g__; s__ |
544749 | 5.56e-08 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__Sinobacteraceae; g__; s__ |
541300 | 1.28e-07 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae; g__CandidatusSolibacter; s__ |
563862 | 1.95e-07 | k__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae; g__CandidatusSolibacter; s__ |
In lieu of a solid statistical approach to compare these results, the results appear consistent across the different OTU picking workflows. The standard open-reference and subsampled open-reference are remarkably consistent.
To confirm that the new reference data set works as expected, I applied standard open-reference OTU picking on the original input sequences against the new reference collection generated by the subsampled OTU analysis. The idea here is that most reads should now hit the reference collection. A number of reads still fail, but on close investigation these turn out to all cluster into singleton OTUs. So, this is expected as singletons are not included in the reference collection (possible to adjust this with the --min_otu_size parameter [default = 2]). The new reference collection that is generated does appear to be sane. The command used for this analysis was:
pick_otus_through_otu_table.py -i /home/ubuntu/data/lauber_88soils/seqs.fna -o /home/ubuntu/data/lauber_88soils/subsample_ref_otus_eval/ucr97_v_new_ref/ -p /home/ubuntu/data/lauber_88soils/subsample_ref_otus_eval/ucr_v_newref_params.txt -aO 3
The parameters file (-p) for this analysis contained the following lines:
pick_otus:otu_picking_method uclust_ref
pick_otus:refseqs_fp /home/ubuntu/data/lauber_88soils/subsample_ref_otus_eval/prefilter60/new_refseqs.fna
pick_otus:enable_rev_strand_match True
This analysis is based on the data presented in the Costello Whole Body study.
Here I checked whether the subsampled reference OTU alpha diversities for all samples were correlated with the de novo OTU picking, standard open-reference OTU picking, and closed-reference OTU picking alpha diversities. The observed species/OTUs metric was calculated on add data sets (alpha_diversity.py -m observed_species), and the Pearson correlations were computed for subsampled reference OTU picking with the three other sets of values.
- subsampled open-reference OTU picking versus de novo OTU picking: r=0.99 p=0.0
- subsampled open-reference OTU picking versus standard open-reference OTU picking: r=1.0 p=0.0
- subsampled open-reference OTU picking versus closed-reference OTU picking: r=0.95 p=0.0
Subsampled open-reference OTU picking alpha diversity values are significantly correlated with de novo, standard open-reference, and closed-reference OTU picking results. This suggests that we will derive the same biological conclusions between regarding alpha diversity when using the subsampled OTU picking workflow.
Here I checked whether Procrustes comparisons of unweighted UniFrac PCoA plots between subsampled open-reference OTU picking and de novo OTU picking, standard open-reference OTU picking, and closed-reference OTU picking yield significant results. This was calculated using transform_coordinate_matrices.py which is described in the Procrustes tutorial. p-values are based on 1000 Monte Carlo iterations.
- subsampled open-reference OTU picking versus de novo OTU picking: M2=0.056 p<0.001
- subsampled open-reference OTU picking versus standard open-reference OTU picking: M2=0.053 p<0.001
- subsampled open-reference OTU picking versus closed-reference OTU picking: M2=0.072 p<0.001
Procrustes results are highly significant for the three comparisons, suggesting that we will derive the same biological conclusions regardless of which of these OTU picking workflows is used.
Here I confirm that the same taxonomy groups are identified as significantly different across the body sites in this study using ANOVA, regardless of which OTU picking workflow is applied. These results were computed with the otu_category_significance.py script. Note that OTU ids are not directly comparable across all analyses, so it is best to compare the taxonomies.
Top 5 OTUs that differ across bins for subsampled open-reference OTU picking:
OTU ID | Bonferroni-adjusted p-value | Taxonomy |
---|---|---|
470747 | 5.37e-151 | k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae |
471122 | 4.35e-143 | k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae |
470477 | 4.14e-135 | k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae |
94166 | 1.61e-125 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae |
442949 | 1.33e-124 | k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae |
Top 5 OTUs that differ across bins for de novo OTU picking:
OTU ID | Bonferroni-adjusted p-value | Taxonomy |
---|---|---|
6389 | 4.54e-148 | k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae |
17234 | 1.16e-141 | k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae |
18227 | 3.05e-139 | k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae |
7262 | 1.22e-134 | k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae |
18575 | 2.74e-122 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae |
Top 5 OTUs that differ across bins for standard open-reference OTU picking:
OTU ID | Bonferroni-adjusted p-value | Taxonomy |
---|---|---|
470747 | 5.37e-151 | k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae |
471122 | 4.35e-143 | k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae |
470477 | 4.14e-135 | k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae |
94166 | 1.61e-125 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae |
442949 | 1.33e-124 | k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae |
Top 5 OTUs that differ across bins for closed-reference OTU picking:
OTU ID | Bonferroni-adjusted p-value | Taxonomy |
---|---|---|
470747 | 3.25e-150 | k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae |
471122 | 2.20e-148 | k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae |
470477 | 2.51e-138 | k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Carnobacteriaceae |
94166 | 4.30e-130 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae |
442949 | 2.04e-121 | k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae |
In lieu of a solid statistical approach to compare these results, the results are remarkably consistent across the different OTU picking workflows.
To investigate what reads get discarded at the prefilter stage, I evaluated a subset of the reads discarded when the prefilter was set to 80% (--prefilter_percent_id 0.80) versus when the prefilter was set to 60% (default).
These three have high percent id matches to 16S sequences in NCBI:
>F12Pinl.140479_129272 FFLHOYS02GCJLO orig_bc=ATACAGAGCTCC new_bc=ATACAGAGCTCC bc_diffs=0
CTGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCGAAAAGACCAGCTAAGCATCATTGGCTTGGTCAGCCTTTACCTAACCAACTACCTGATACTACGTGGGCTCATCGAACAGCGCGAATTAGCTTGCTTTATGAATTATTCAGGATTTGGAGTGAACTATTCGGCAGATTCCCACGCGTTACGCACCCGTTCGCCACTTTGCTTG
>F32Indr.140459_1174716 FFO92CG02IYZBA orig_bc=GCTATCACGAGT new_bc=GCTATCACGAGT bc_diffs=0
CCGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCGAAAAGACCAGCTAAGCATCATTGGCTTGGTCAGCCTTTACCTGACCAACTACCTAATACTACGCAGGCTCATCAAACAGCGCTTTTTAGCTTTCTTCAGGATTTGGCCCGAACTGTTCGGCAGATTCCCACGCGTTACGCACCCGTTCGCCACTTTGTTCTCAACTGTTCCCACCTCCTGGGCGAGA
>F32Forr.140528_1210712 FFO92CG02IKGYS orig_bc=GCGTTACACACA new_bc=GCGTTACACACA bc_diffs=0
CCGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCGAAAAGACCAGCTAAGCATCATTGGCTTGGTCAGCCTTTACCTGACCAACTACCTAATACTACGCAGGCTCATCAAACAGCGCTTTTGAGCTTTCTTCAGGATTTGGCCCGAACTGTTCGGCAGATTCCCACGCGTTACGCACCCGTTCGCCACTTTGTTCTCAACTATTCCGATTCTTTTTTCGGTAGGCCGTTA
These three reads hit a small fragment, a human sequence, and nothing in NCBI, respectively:
>M22Pinr.140692_1148864 FFO92CG01EQIWQ orig_bc=CGCACATGTTAT new_bc=CGCACATGTTAT bc_diffs=0
GGAAAAGGGAAAAACAGATGAGACAAATAGAAAACAAATAGCAAATTAGTAGGTGTTAACATGACTTTATCAATAATTACATCAAATGTAGATGATGTTAACCATGGATTGACAAACTTTTTCTTTATAGGACCAGACAGTCAATATTTTAGGTCTTTGAGGCCATATGGTATCTGTCATAACCACTCAACTGAGCCAGGATCAAACTCTGA
>F31Nstl.140789_1153834 FFO92CG02FSK33 orig_bc=GCAGCCGAGTAT new_bc=GCAGCCGAGTAT bc_diffs=0
CANNOT INCLUDE THIS READ DUE TO IRB RESTRICTIONS
>F32Nstl.140804_1160735 FFO92CG01BRQNZ orig_bc=GCTGCTGCAATA new_bc=GCTGCTGCAATA bc_diffs=0
CTGAAACCCTGGGTCACCAAAAGGCAGGAGGAGGAGGGACAGGGCAAGGCAGGGGAAGAGAGGGGAGGCTGACTCACATACACACATATGCATGCACACATCACACCCACATTCATGTACACACACACAGATTCACATGCATGCACAGCACAATCGCACACTTGTATACACACACAGGCACA
Based on this analysis (and currently unpublished data – will fill in when available), a threshold of 60% was chosen as the default value for discarding sequences that are likely not rRNA.
The subsampled open-reference OTU picking workflow can be run in iterative mode to support multiple different sequence collections, such as several HiSeq runs. In iterative mode, the list of sequence files will be processed in order, and the new reference sequences generated at each step will be used as the reference collection for the subsequent step. After all input collections have been processed a single OTU table and tree, covering all of the input collections, will be generated.
To apply this analysis to seqs1.fna and seqs2.fna in iterative mode, picking OTUs against the reference collection refseqs.fna you can run the following command.
To apply this analysis to seqs1.fna, picking OTUs against the reference collection refseqs.fna you can run the following command. Your parameters file should contain the following:
pick_otus:otu_picking_method uclust_ref
pick_otus:enable_rev_strand_match True
You should always use full paths which are represented here by $PWD, but will usually look something like /home/ubuntu/my_data/ (in other words, they should start with a /). In this example your input sequences (seqs1.fna), and your metadata mapping file (map.txt) are all in the same directory represented by $PWD. If you work from the directory containing those files, you can leave $PWD in the commands instead of specifying the full paths:
pick_subsampled_reference_otus_through_otu_table.py -i $PWD/seqs1.fna,$PWD/seqs2.fna -r $PWD/refseqs.fna -o $PWD/ucrss_iter/ -aO 8 -p $PWD/ucrss_params.txt
This command should be run in parallel. Each job will need approximately 4GB of RAM, so if running on EC2 and you want to start 8 parallel jobs (recommended setting for EC2), your instance type should be m2.4xlarge. The -aO 8 specifies that we want to start 8 parallel jobs - adjust this according to the resources you have available.
After iterative OTU picking you can continue on with beta diversity (and other) analyses as described here.