News and Announcements » |
The pyrosequencing technology employed by 454 sequencing machines produces characteristic sequencing errors, mostly imprecise signals for longer homopolymers runs. Most of the sequences contain none or only a few errors, but a few sequences contain enough errors to be classified as an additional rare OTU. The goal for the denoising procedure is to reduce the amount of erroneous OTUs and thus increasing the accuracy of the whole QIIME pipeline.
If there are multiple, large 454 runs, follow this tutorial to denoise the data set and analyze it with QIIME. In short, each 454 run needs to be preprocessed with split_libraries.py and denoised separately. Afterwards the output files are combined for OTU picking. We will show an example with two 454 runs (run1.sff and run2.sff).
Data preparation:
From the raw, binary sff file, three files need to be generated for each run with the sffinfo tool from 454. You should have this tool if you have a 454 sequencer. Otherwise ask the sequencing facility for the files:
sffinfo run_1.sff > run_1.sff.txt
sffinfo -s run_1.sff > run_1.fasta
sffinfo -q run_1.sff > run_1.qual
sffinfo run_2.sff > run_2.sff.txt
sffinfo -s run_2.sff > run_2.fasta
sffinfo -q run_2.sff > run_2.qual
Note that the qiime since package v1.2 has a replacement for the sfftools. It’s slower but fully functional.
Warning
Warning: Since late 2012, 454 machines have a new feature (flow pattern B) that is supposed to allow for longer reads. Unfortunately, files using this feature can not be denoised, but result in nonsense output. To make sure that your file uses the older, more common flow pattern A, open the .sff.txt file and look for the Flow Chars: section in the header. If it shows a constant repeat of TACG you are fine. If however the pattern deviates after the third repeat, you are looking at the new flow pattern B that can not be denoised. In any case, all other qiime programs are not affected by this and can be used as usual.
For more details on the available options of each script explained in the following use the -h option.
Quality filtering and barcode assignment:
Prior to denoising, each read has to be assigned to one barcode/sample and low quality reads need to be filtered out. This can be done using split_libraries.py. An example command would be:
split_libraries.py -o run1 -f run1.fasta -q run1.qual -m run1_mapping.txt -w 50 -g -r -l 150 -L 350
split_libraries.py -o run2 -f run2.fasta -q run2.qual -m run2_mapping.txt -w 50 -g -r -l 150 -L 350 -n 1000000
This step has to be done separately for each 454 pool, following the usual guidelines for running several data sets through split_libraries.py. Note that all options to split_libraries.py that truncate the sequences on the 3’ end should not be used as they do not affect the sff.txt files used for denoising. This includes the -x, -z truncate_only, and -w without -g options. We recommend though to use the -w 50 -g combination to discard reads of bad quality. Also, do not use the truncate_fasta_qual_files.py script if you plan to denoise your data. If you need to truncate your data, use the sfffile program from the Roche sfftools package and recreate your fasta and qual files from the truncated sff file.
For a single, non-barcoded sample, split_libraries.py can be provided with a mapping file that has an empty field for the BarcodeSequence.
Example:
Note
Note that fields must be separated by a single tab. For the empty barcode there must be two tabs between SampleID and the primer sequence. Use QIIME’s validate_mapping_file.py to test for validity. Then, use split_libraries.py as usual, but with option -b 0.
Flowgram clustering (aka denoising)
Each run will be denoised using its quality filtered output of split_libraries.py and the initial .sff.txt file. All flowgrams without a match in the provided split_libraries.py FASTA file are removed. The sequencing primer will be extracted from the metadata mapping file:
denoise_wrapper.py -v -i run1.sff.txt -f run1/seqs.fna -o run1/denoised/ -m run1_mapping.txt
denoise_wrapper.py -v -i run2.sff.txt -f run2/seqs.fna -o run2/denoised/ -m run2_mapping.txt
Denoising large data sets is computationally demanding. While smaller data sets (< 50,000 sequences) can be run on one single machine within an hour, a typical 454 run with 400,000 sequences after quality filtering requires up to a day on a 24 core cluster. If the denoiser is set up properly on your cluster or multi-core machine, it can be started in parallel mode using the option -n:
denoise_wrapper.py -v -i run1.sff.txt -f run1/seqs.fna -o run1/denoised/ -m run1_mapping.txt -n 24
The output files of this step is stored in directory run1/ and run2/, respectively:
- denoiser.log: Information about the clustering procedure if run in verbose mode (-v). Can be used to monitor the program’s progress.
- centroids.fasta: The centroids of clusters with 2 and more members.
- singletons.fasta: Reads that could not be clustered.
- denoiser_mapping.txt: The cluster to read mapping.
- denoised_clusters.txt: A cluster mapping in qiime format. Equivalent to 4.
- denoised_seqs.fasta: Centroids and singletons combined and sorted by cluster size.
Usually the centroid and singleton files are combined for downstream analysis, but occasionally it might make sense to remove the low confidence singletons. 2, 3, and 4 are used as input to the next step.
Re-integrating the denoised data into QIIME
The final step in a denoising run usually is the re-integration of the data into the QIIME pipeline. Since the denoiser uses flowgram similarity for clustering there is no guaranteed sequence (dis)-similarity between cluster centroids. In order to create the usual species-level OTUs at 97% sequence similarity, you must inflate the denoiser results and then run one of QIIME’s OTU pickers on the combined denoiser output.
Inflating denoiser results refers to process of creating a new fasta file of denoised sequences where each centroid sequence is written n times, where n is the cluster size, and each singleton is written once. Flowgram identifiers are mapped to sequence identifiers using the original input file.
To inflate the results of a single denoiser run call:
inflate_denoiser_output.py -c centroids.fna -s singletons.fna -f seqs.fna -d denoiser_mapping.txt -o denoised_seqs.fna
To inflate the results from independent denoise_wrapper.py runs, pass all of the centroid, singleton, input fasta files, and denoiser maps:
inflate_denoiser_output.py -c centroids1.fna,centroids2.fna -s singletons1.fna,singletons2.fna -f seqs1.fna,seqs2.fna -d denoiser_mapping1.txt,denoiser_mapping2.txt -o denoised_seqs.fna
Your denoised sequences can now be fed directly into QIIME at the OTU picking stage. The next step will be to run one of the OTU pickers or OTU picking workflow scripts (e.g., pick_otus.py, pick_de_novo_otus.py, pick_closed_reference_otus.py, or pick_open_reference_otus.py,. At the OTU picking stage it is very important that you allow for the abundance presorting, which is currently in place for the uclust OTU picker only. We therefore don’t recommend using other OTU pickers, and do not pass the -D/–suppress_presort_by_abundance_uclust option to pick_otus.py. If possible, it is worth using uclust with --optimal to assure the best possible choice of OTUs.:
pick_otus.py -s 0.97 -i denoised_seqs.fna -m uclust --optimal
Passing --optimal may be prohibitively compute-intensive for large analyses however (for example, greater than a single 454 FLX run). The default QIIME pick_otus.py parameters are likely to be sufficient.
Notes:
Low-level Interface
denoise_wrapper.py provides an easy to use interface to the denoiser, which is sufficient in most cases. For power users, we also provide two low level scripts, that allow for more flexibility.
Cluster phase 1 - prefix clustering
All flowgrams corresponding to the sequences that are in seqs.fna (presumed to be the output of split_libraries.py) are pulled from the .sff.txt file and primer, barcodes and the 454 key sequence are removed. Then, the first clustering phase groups reads based on common prefixes. For a full FLX run this will usually take less than an hour on a standard computer and requires less than 1 GB of memory.
Example command:
denoiser_preprocess.py -i 454Reads.sff.txt -f seqs.fna -o example_pp -s -v -p CATGCTGCCTCCCGTAGGAGT
Several files are stored in the specified output directory. To see the clustering stastics check the file preprocess.log in the output directory. Basically the less clusters there are (especially small clusters) the faster the next phase will run. If there are more than 100.000 sequences remaining, the input set might be split, to achieve a reasonable run time. The files in the output directory are used in the next step.
Cluster phase II - Flowgram clustering or Denoising
This is the main clustering step and the computationally most expensive one. Flowgrams are clustered based on their similarity.
Example command:
denoiser.py -i 454Reads.sff.txt -p example_pp -v -o example_denoised
The preprocessing information in example_pp is used and the output is stored in a randomly named, new direcory in example_denoised. Note, that when the -p option is not specified here, the preprocessing is invoked from denoiser.py implicitly.
Because of the potential long runtime, we suggest to distribute the work over many cpus. If you have a multi-core system or cluster available and set up the required job submission script (cluster_jobs_fp in your qiime config) the following command will distribute the computation over 24 cpus:
denoiser.py -i 454Reads.sff.txt -p example_pp -v -o example_denoised -c -n 24
Make sure the output directory is shared by all cluster nodes. Depending on the complexity of the data this step might take up to a day even on a 24 core system for a full 454 run with 400-500 k sequences. Smaller data sets will be finished much faster. The output will be written to a randomly named directory within the specified output directory. The output files are:
Can be used to monitor the program’s progress.
centroids.fasta: The centroids of clusters with 2 and more members
singletons.fasta: Reads that could not be clustered.
denoiser_mapping.txt: The cluster to read mapping.
Usually the centroid and singleton files are combined for downstream analysis, but occasionally it might make sense to remove the low confidence singletons.
Notes for running on cluster/multicore system
We use a very simple setup to farm out the flowgram alignments to a cluster. A master process (denoiser.py) sends data to each worker (denoiser_worker.py). A worker sleeps while waiting for the data. Once the file appears it processes it and sends the result back to the master and goes back to sleep. The master collects all results and iterates. As such, performance is higly dependent on the actual cluster setup:
The overall speed is governed by the slowest worker node
as one jobs remains queued, the other jobs will block your cluster. Decrease the number of workers if you run into this problem.
FAQ
Q: How does this denoising procedure differ from PyroNoise?
Q: What is the expected run-time?
Q: Can I denoise Titanium data
Q: How can I speed up the computation?
Q: Why are there so few sequences in my output file after denoising? Did something went wrong with my sequencing run?
Q: So where are all the sequences then?
Q: Can I cluster at different sequence/flowgram similarity thresholds?
Q: Denoising on the clusters “hangs” after a while. What is going on?
Q: How and why can I run the preprocessing step separately?
Q: What about different next-gen sequencing platforms?
—
Q: How does this denoising procedure differ from PyroNoise?
A: PyroNoise uses an expectation maximization (EM) algorithm to figure out the most likely sequence for every read. We, instead, use a greedy scheme that can be seen as an approximation to PyroNoise. According to several test data sets, our approximation gives very similar results in a fraction of the time.
Q: What is the expected run-time?
A: The whole heuristic for our method depends on the actual species distribution in your samples. An ideal data set has few species and a very skewed abundance distribution with a few, very abundant species. With more species and a flatter abundance distribution run time increases. You can get a rough estimate of the run time after the preprocessing step by looking at the number of reads printed in the log file in verbose mode. Very, very roughly, compute time increases quadratically with the number of reads after preprocessing:
Note
If the number of remaining sequences is smaller than 50.000, you can expect <24 hours on 20 cpus. With 100k seqs you would need 80 cpus to expect it to finish within a day.
Here are some guidelines from runs with actual data:
Q: Can I denoise Titanium data?
A: Yes. The algorithm can process Titanium data and we have done it several times. As of (denoiser) version 0.9/Qiime-1.2 we ship an error profile for the titanium platform with the package. Use the switch –titanium to enable the new profile. Be aware that Titanium still takes considerably longer than FLX.
Q: How can I speed up the computation?
A: 1. Use more CPUs if available.
Q: Why are there so few sequences in my output file after denoising? Did something went wrong with my sequencing run?
A: No, this is expected. The denoising procedure (and this also holds for Chris Quince’s Pyronoise) technically do not remove any reads from the input set. This is the task of the initial quality filtering, which we suggest to do using Qiime’s split_libraries.py. The denoising is basically a clustering approach on the flowgram level, i.e. all reads that look similar enough on the flowgram level are clustered and only the centroid of each cluster is reported in the output file (either in centroids.fasta if the cluster has more than one member or otherwise in singletons.fasta). You can think of the centroids as OTUs on the flowgram level. Since flowgram similarity does not correlate perfectly with sequence-similarity, we usually don’t call them OTUs, but only after an extra OTU picking step with, say, cd-hit or uclust on the denoised sequences.
Q: So where are all the sequences then?
A: If you look at the file denoiser_mapping.txt, e.g. like this:
wc denoiser_mapping.txt
you should see that the number in the middle of the output (i.e. the number of words) is about the number of sequences in your input set. (Sometimes, the denoiser discards a few additional reads due to quality issues that were not captured by split_libraries.py). All reads that are in this mapping file can and will be used e.g. in the downstream Qiime analysis. The first number in the wc output gives the number of lines on the files, which corresponds to the number of clusters after denoising.
Q: Can I cluster at different sequence/flowgram similarity thresholds?
A: Basically, Yes. The default clustering parameters are set and tested to work well at 0.97% sequence similarity. If you want to cluster at, say, 0.95% you have to increase both cut-offs and decrease the percent_ID:
The low_cut_off and the percent_id are used for clustering in the second, greedy clustering step. The high_cut_off is used in the third clustering step, where unclustered reads are mapped according to their best match to any of the clusters of phase II. For good values for the thresholds, we refer to the plot S2 in the supplementary material of the denoiser paper (Reeder and Knight, Nature Methods 2010).
Q: Denoising on the clusters “hangs” after a while. What is going on?
A: If not provided with already preprocessed data via the -p option, the denoiser.py script automatically starts the preprocessing phase (cluster phase I in the paper) on one CPU on the cluster. This preprocessing takes from a few minutes for partial GS FLX runs to an hour or more for large Titanium runs. After this step, the parallel cluster phase II starts. First, all requested workers are started one-by-one. Depending on your queueing system and the number of jobs this might take from few seconds to several minutes. If one or more of the jobs are not started by the queueing system, all submitted jobs will block and wait. This is most likely the state your process is in if nothing seems to happen. We know this is not optimally and already thinking about a better solution for the future. In the meantime, make sure you only request as many jobs as you can safely run in your queue and monitor (qstat) the startup phase to see if all jobs are properly scheduled. If you finf that you requested to many CPUs and need to restart, simply kill the master process (denoiser.py) and it should bring down all but the last submitted jobs. The last job might need to be killed by hand. Once all workers are succesfully started, you can monitor the progress by following the log file in verbose mode (toggled by the -v option):
tail -f denoiser.log
Q: How and why can I run the preprocessing step separately?
A: If you call denoiser.py without the -p option (or via its wrapper denoise_wrapper.py in QIIME) the preprocessing step (cluster phase I) is implicitly called. You can explicitly run the preprocessing step via the script preprocess.py and provide the output directory to denoiser.py using the -p option. Reasons for running the steps separately could be:
Q: What about different next-gen sequencing platforms?
A: Denoising in this form only applies to 454 based pyrosequencing.