News and Announcements » |
Description:
Using barcodes and/or data from fasta labels provided in a mapping file, will demultiplex sequences from an input fasta file. Barcodes will be removed from the sequences in the output fasta file by default. If a quality scores file is supplied, the quality score file will be truncated to match the output fasta file. The default barcode type are 12 base pair Golay codes. Alternative barcodes allowed are 8 base pair Hamming codes, variable_length, or generic barcodes of a specified length. Generic barcodes utilize mismatch counts for correction. One can also use an added demultiplex field (-j option) to specify data in the fasta labels that can be used alone or in conjunction with barcode sequences for demultiplexing. All barcode correction is disabled when variable length barcodes are used.
Usage: demultiplex_fasta.py [options]
Input Arguments:
Note
[REQUIRED]
[OPTIONAL]
Output:
Four files can be generated by %prog
Standard Example:
Using a single 454 run, which contains a single FASTA, QUAL, and mapping file while using default parameters and outputting the data into the Directory “demultiplexed_output”:
demultiplex_fasta.py -m Mapping_File_golay.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o demultiplexed_output/
For the case where there are multiple FASTA and QUAL files, the user can run the following command as long as there are not duplicate barcodes listed in the mapping file:
demultiplex_fasta.py -m Mapping_File_golay.txt -f 1.TCA.454Reads.fna,2.TCA.454Reads.fna -q 1.TCA.454Reads.qual,2.TCA.454Reads.qual -o demultiplexed_output_comma_separated/
Duplicate Barcode Example:
An example of this situation would be a study with 1200 samples. You wish to have 400 samples per run, so you split the analysis into three runs with and reuse barcodes (you only have 600). After initial analysis you determine a small subset is underrepresented (<500 sequences per samples) and you boost the number of sequences per sample for this subset by running a fourth run. Since the same sample IDs are in more than one run, it is likely that some sequences will be assigned the same unique identifier by %prog when it is run separately on the four different runs, each with their own barcode file. This will cause a problem in file concatenation of the four different runs into a single large file. To avoid this, you can use the ‘-n’ parameter which defines a start index for %prog fasta label enumeration. From experience, most 454 runs (when combining both files for a single plate) will have 350,000 to 650,000 sequences. Thus, if Run 1 for %prog uses ‘-n 1000000’, Run 2 uses ‘-n 2000000’, etc., then you are guaranteed to have unique identifiers after concatenating the results of multiple 454 runs. With newer technologies you will just need to make sure that your start index spacing is greater than the potential number of sequences.
To run %prog, you will need two or more (depending on the number of times the barcodes were reused) separate mapping files (one for each Run, for example one Run1 and another one for Run2), then you can run %prog using the FASTA and mapping file for Run1 and FASTA and mapping file for Run2. Once you have independently run demultiplex_fasta on each file, followed by quality filtering, you can concatenate (cat) the sequence files generated. You can also concatenate the mapping files, since the barcodes are not necessary for downstream analyses, unless the same sample ids are found in multiple mapping files.
Run %prog on Run 1:
demultiplex_fasta.py -m Mapping_File1.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o demultiplexed_output_Run1/ -n 1000000
Run demultiplex_fasta on Run 2:
demultiplex_fasta.py -m Mapping_File2.txt -f 2.TCA.454Reads.fna -q 2.TCA.454Reads.qual -o demultiplexed_output_Run2/ -n 2000000
Barcode Decoding Example:
The standard barcode types supported by %prog are golay (Length: 12 NTs) and hamming (Length: 8 NTs). For situations where the barcodes are of a different length than golay and hamming, the user can define a generic barcode type “-b” as an integer, where the integer is the length of the barcode used in the study.
For the case where the generic 8 base pair barcodes were used, you can use the following command:
demultiplex_fasta.py -m Mapping_File_8bp_barcodes.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o demultiplexed_output_8bp_barcodes/ -b 8
To use the run prefix at the beginning of the fasta label for demultiplexing, there has to be a field in the mapping file labeled “run_prefix”, and can be used by the following command:
demultiplex_fasta.py -m Mapping_File_run_prefix.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o demultiplexed_output_run_prefix/ -j run_prefix