sampledoc
News and Announcements »

demultiplex_fasta.py – Demultiplex fasta data according to barcode sequences or data supplied in fasta labels.

Description:

Using barcodes and/or data from fasta labels provided in a mapping file, will demultiplex sequences from an input fasta file. Barcodes will be removed from the sequences in the output fasta file by default. If a quality scores file is supplied, the quality score file will be truncated to match the output fasta file. The default barcode type are 12 base pair Golay codes. Alternative barcodes allowed are 8 base pair Hamming codes, variable_length, or generic barcodes of a specified length. Generic barcodes utilize mismatch counts for correction. One can also use an added demultiplex field (-j option) to specify data in the fasta labels that can be used alone or in conjunction with barcode sequences for demultiplexing. All barcode correction is disabled when variable length barcodes are used.

Usage: demultiplex_fasta.py [options]

Input Arguments:

Note

[REQUIRED]

-m, --map
Name of mapping file. NOTE: Must contain a header line indicating SampleID in the first column and BarcodeSequence in the second, LinkerPrimerSequence in the third.
-f, --fasta
Names of fasta files, comma-delimited

[OPTIONAL]

-q, --qual
File paths of qual files, comma-delimited [default: None]
-B, --keep_barcode
Do not remove barcode from sequences
-b, --barcode_type
Barcode type, hamming_8, golay_12, variable_length (will disable any barcode correction if variable_length set), or a number representing the length of the barcode, such as -b 4. The max barcode errors (-e) should be lowered for short barcodes. [default: golay_12]
-o, --dir_prefix
Directory prefix for output files [default: .]
-e, --max_barcode_errors
Maximum number of errors in barcode. If using generic barcodes every 0.5 specified counts as a primer mismatch. [default: 1.5]
-n, --start-numbering-at
Seq id to use for the first sequence [default: 1]
--retain_unassigned_reads
Retain sequences which can not be demultiplexed in a seperate output sequence file [default: False]
-c, --disable_bc_correction
Disable attempts to find nearest corrected barcode. Can improve performance. [default: False]
-F, --save_barcode_frequencies
Save frequences of barcodes as they appear in the given sequences. Sorts in order of largest to smallest. Will do nothing if barcode type is 0 or variable_length. [default: False]
-j, --added_demultiplex_field
Use -j to add a field to use in the mapping file as an additional demultiplexing option to the barcode. All combinations of barcodes and the values in these fields must be unique. The fields must contain values that can be parsed from the fasta labels such as “plate=R_2008_12_09”. In this case, “plate” would be the column header and “R_2008_12_09” would be the field data (minus quotes) in the mapping file. To use the run prefix from the fasta label, such as “>FLP3FBN01ELBSX”, where “FLP3FBN01” is generated from the run ID, use “-j run_prefix” and set the run prefix to be used as the data under the column headerr “run_prefix”. [default: None]

Output:

Four files can be generated by demultiplex_fasta.py

  1. seqs.fna - This contains the fasta sequences, demultiplexed according to barcodes and/or added demultiplexed field.
  2. demultiplexed_sequences.log - Contains details about demultiplexing stats
  3. seqs.qual - If quality score file(s) are supplied, these will be truncated to match the seqs.fna file after barcode removal if such is enabled.
  4. seqs_not_assigned.fna - If --retain_unassigned_reads is enabled, will write all sequences that can not be demultiplexed to this file. Also will create a seqs_not_assigned.qual file if quality file supplied.

Standard Example:

Using a single 454 run, which contains a single FASTA, QUAL, and mapping file while using default parameters and outputting the data into the Directory “demultiplexed_output”:

demultiplex_fasta.py -m Mapping_File_golay.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o demultiplexed_output/

For the case where there are multiple FASTA and QUAL files, the user can run the following command as long as there are not duplicate barcodes listed in the mapping file:

demultiplex_fasta.py -m Mapping_File_golay.txt -f 1.TCA.454Reads.fna,2.TCA.454Reads.fna -q 1.TCA.454Reads.qual,2.TCA.454Reads.qual -o demultiplexed_output_comma_separated/

Duplicate Barcode Example:

An example of this situation would be a study with 1200 samples. You wish to have 400 samples per run, so you split the analysis into three runs with and reuse barcodes (you only have 600). After initial analysis you determine a small subset is underrepresented (<500 sequences per samples) and you boost the number of sequences per sample for this subset by running a fourth run. Since the same sample IDs are in more than one run, it is likely that some sequences will be assigned the same unique identifier by demultiplex_fasta.py when it is run separately on the four different runs, each with their own barcode file. This will cause a problem in file concatenation of the four different runs into a single large file. To avoid this, you can use the ‘-n’ parameter which defines a start index for demultiplex_fasta.py fasta label enumeration. From experience, most 454 runs (when combining both files for a single plate) will have 350,000 to 650,000 sequences. Thus, if Run 1 for demultiplex_fasta.py uses ‘-n 1000000’, Run 2 uses ‘-n 2000000’, etc., then you are guaranteed to have unique identifiers after concatenating the results of multiple 454 runs. With newer technologies you will just need to make sure that your start index spacing is greater than the potential number of sequences.

To run demultiplex_fasta.py, you will need two or more (depending on the number of times the barcodes were reused) separate mapping files (one for each Run, for example one Run1 and another one for Run2), then you can run demultiplex_fasta.py using the FASTA and mapping file for Run1 and FASTA and mapping file for Run2. Once you have independently run demultiplex_fasta on each file, followed by quality filtering, you can concatenate (cat) the sequence files generated. You can also concatenate the mapping files, since the barcodes are not necessary for downstream analyses, unless the same sample ids are found in multiple mapping files.

Run demultiplex_fasta.py on Run 1:

demultiplex_fasta.py -m Mapping_File1.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o demultiplexed_output_Run1/ -n 1000000

Run demultiplex_fasta on Run 2:

demultiplex_fasta.py -m Mapping_File2.txt -f 2.TCA.454Reads.fna -q 2.TCA.454Reads.qual -o demultiplexed_output_Run2/ -n 2000000

Barcode Decoding Example:

The standard barcode types supported by demultiplex_fasta.py are golay (Length: 12 NTs) and hamming (Length: 8 NTs). For situations where the barcodes are of a different length than golay and hamming, the user can define a generic barcode type “-b” as an integer, where the integer is the length of the barcode used in the study.

For the case where the generic 8 base pair barcodes were used, you can use the following command:

demultiplex_fasta.py -m Mapping_File_8bp_barcodes.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o demultiplexed_output_8bp_barcodes/ -b 8

To use the run prefix at the beginning of the fasta label for demultiplexing, there has to be a field in the mapping file labeled “run_prefix”, and can be used by the following command:

demultiplex_fasta.py -m Mapping_File_run_prefix.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o demultiplexed_output_run_prefix/ -j run_prefix

Site index


sampledoc