join_paired_ends.py – Joins paired-end Illumina reads.
Description:
This script takes forward and reverse Illumina reads and joins them using the method chosen. Will optionally create an updated index reads file containing index reads for the surviving joined paired end reads. If the option to write an updated index file is chosen, be sure that the order and header format of the index reads is the same as the order and header format of reads in the files that will be joined (this is the default for reads generated on the Illumina instruments).
Currently, there are two methods that can be selected by the user to join paired-end data:
- fastq-join - Erik Aronesty, 2011. ea-utils : “Command-line tools for processing biological sequencing data” (http://code.google.com/p/ea-utils)
- SeqPrep - (https://github.com/jstjohn/SeqPrep)
Usage: join_paired_ends.py [options]
Input Arguments:
Note
[REQUIRED]
- -f, --forward_reads_fp
- Path to input forward reads in FASTQ format.
- -r, --reverse_reads_fp
- Path to input reverse reads in FASTQ format.
- -o, --output_dir
- Directory to store result files
[OPTIONAL]
- -m, --pe_join_method
- Method to use for joining paired-ends. Valid choices are: fastq-join, SeqPrep [default: fastq-join]
- -b, --index_reads_fp
- Path to the barcode / index reads in FASTQ format. Will be filtered based on surviving joined pairs.
- -j, --min_overlap
- Applies to both fastq-join and SeqPrep methods. Minimum allowed overlap in base-pairs required to join pairs. If not set, progam defaults will be used. Must be an integer. [default: None]
- -p, --perc_max_diff
- Only applies to fastq-join method, otherwise ignored. Maximum allowed % differences within region of overlap. If not set, progam defaults will be used. Must be an integer between 1-100 [default: None]
- -y, --max_ascii_score
- Only applies to SeqPrep method, otherwise ignored. Maximum quality score / ascii code allowed to appear within joined pairs output. For more information, please see: http://en.wikipedia.org/wiki/FASTQ_format. [default: J]
- -n, --min_frac_match
- Only applies to SeqPrep method, otherwise ignored. Minimum allowed fraction of matching bases required to join reads. Must be a float between 0-1. If not set, progam defaults will be used. [default: None]
- -g, --max_good_mismatch
- Only applies to SeqPrep method, otherwise ignored. Maximum mis-matched high quality bases allowed to join reads. Must be a float between 0-1. If not set, progam defaults will be used. [default: None]
- -6, --phred_64
- Only applies to SeqPrep method, otherwise ignored. Set if input reads are in phred+64 format. Output will always be phred+33. [default: False]
Output:
All paired-end joining software will return a joined / merged / assembled paired-end fastq file. Depending on the method chosen, additional files may be written to the user-specified output directory.
- fastq-join will output fastq-formatted files as:
- “*.join”: assembled / joined reads output
- “*.un1”: unassembled / unjoined reads1 output
- “*.un2”: unassembled / unjoined reads2 output
- SeqPrep will output fastq-formatted gzipped files as:
- “*_assembled.gz”: unassembled / unjoined reads1 output
- “*_unassembled_R1.gz”: unassembled / unjoined reads1 output
- “*_unassembled_R2.gz”: unassembled / unjoined reads2 output
- If a barcode / index file is provided via the ‘-b’ option, an updated
barcodes file will be output as:
- ”..._barcodes.fastq”: This barcode / index file must be used in
conjunction with the joined
paired-ends file as input to split_libraries_fastq.py. Except for
missing reads that may result from failed merging of paired-ends, the
index-reads and joined-reads must be in the same order.
Join paired-ends with ‘fastq-join’:
This is the default method to join paired-end Illumina data:
join_paired_ends.py -f $PWD/forward_reads.fastq -r $PWD/reverse_reads.fastq -o $PWD/fastq-join_joined
Join paired-ends with ‘SeqPrep’:
Produces similar output to the ‘fastq-join’ but returns data in gzipped format.
join_paired_ends.py -m SeqPrep -f $PWD/forward_reads.fastq -r $PWD/reverse_reads.fastq -o $PWD/SeqPrep_joined
Update the index / barcode reads file to match the surviving joined pairs.:
This is required if you will be using split_libraries_fastq.py.
join_paired_ends.py -f $PWD/forward_reads.fastq -r $PWD/reverse_reads.fastq -b $PWD/barcodes.fastq -o $PWD/fastq-join_joined