sampledoc
News and Announcements »

exclude_seqs_by_blast.py – Exclude contaminated sequences using BLAST

Description:

This code is designed to allow users of the QIIME workflow to conveniently exclude unwanted sequences from their data. This is mostly useful for excluding human sequences from runs to comply with Internal Review Board (IRB) requirements, but may also have other uses (e.g. perhaps excluding a major bacterial contaminant). Sequences from a run are searched against a user-specified subject database, where BLAST hits are screened by e-value and the percentage of the query that aligns to the sequence.

For human screening THINK CAREFULLY about the data set that you screen against. Are you excluding human non-coding sequences? What about mitochondrial sequences? This point is CRITICAL because submitting human sequences that are not IRB-approved is BAD.

(e.g. you would NOT want to just screen against just the coding sequences of the human genome as found in the KEGG .nuc files, for example)

One valid approach is to screen all putative 16S rRNA sequences against greengenes to ensure they are bacterial rather than human.

WARNING: You cannot use this script if there are spaces in the path to the database of fasta files because formatdb cannot handle these paths (this is a limitation of NCBI’s tools and we have no control over it).

Usage: exclude_seqs_by_blast.py [options]

Input Arguments:

Note

[REQUIRED]

-i, --querydb
The path to a FASTA file containing query sequences
-d, --subjectdb
The path to a FASTA file to BLAST against
-o, --outputdir
The output directory

[OPTIONAL]

-e, --e_value
The e-value cutoff for blast queries [default: 1e-10].
-p, --percent_aligned
The % alignment cutoff for blast queries, expressed as a fraction between 0 and 1 [default: 0.97].
--no_clean
If set, don’t delete files generated by formatdb after running [default: False].
--blastmatroot
Path to a folder containing blast matrices [default: None].
--working_dir
Working dir for BLAST [default: /tmp/].
-m, --max_hits
Max hits parameter for BLAST. CAUTION: Because filtering on alignment percentage occurs after BLAST, a max hits value of 1 in combination with an alignment percent filter could miss valid contaminants. [default: 100]
-w, --word_size
Word size to use for BLAST search [default: 28]
-n, --no_format_db
If this flag is specified, format_db will not be called on the subject database (formatdb will be set to False). This is useful if you have already formatted the database and a) it took a very long time or b) you want to run the script in parallel on the pre-formatted database [default: False]

Output:

Four output files are generated based on the supplied outputpath + unique suffixes:

  1. “filename_prefix”.matching: A FASTA file of sequences that did pass the screen (i.e. matched the database and passed all filters).
  2. “filename_prefix”.non-matching: A FASTA file of sequences that did not pass the screen.
  3. “filename_prefix”.raw_blast_results: Contains the raw BLAST results from the screening.
  4. “filename_prefix”.sequence_exclusion_log: A log file summarizing the options used and results obtained.

In addition, if the --no_clean option is passed, the files generated by formatdb will be kept in the same directory as subjectdb.

Examples:

The following is a simple example, where the user can take a given FASTA file (i.e. resulting FASTA file from pick_rep_set.py) and blast those sequences against a reference FASTA file containing the set of sequences which are considered contaminated:

exclude_seqs_by_blast.py -i repr_set_seqs.fasta -d ref_seq_set.fna -o exclude_seqs/

Alternatively, if the user would like to change the percent of aligned sequence coverage (“-p”) or the maximum E-value (“-e”), they can use the following command:

exclude_seqs_by_blast.py -i repr_set_seqs.fasta -d ref_seq_set.fna -o exclude_seqs/ -p 0.95 -e 1e-10

sampledoc