sampledoc
News and Announcements »

filter_fasta.py – This script can be applied to remove sequences from a fasta or fastq file based on input criteria.

Description:

Usage: filter_fasta.py [options]

Input Arguments:

Note

[REQUIRED]

-f, --input_fasta_fp
Path to the input fasta file
-o, --output_fasta_fp
The output fasta filepath

[OPTIONAL]

-m, --otu_map
An OTU map where sequences ids are those which should be retained
-s, --seq_id_fp
A list of sequence identifiers (or tab-delimited lines with a seq identifier in the first field) which should be retained
-b, --biom_fp
A biom file where otu identifiers should be retained
-a, --subject_fasta_fp
A fasta file where the seq ids should be retained.
-p, --seq_id_prefix
Keep seqs where seq_id starts with this prefix
--sample_id_fp
Keep seqs where seq_id starts with a sample id listed in this file
-n, --negate
Discard passed seq ids rather than keep passed seq ids [default: False]
--mapping_fp
Mapping file path (for use with –valid_states) [default: None]
--valid_states
Description of sample ids to retain (for use with –mapping_fp) [default: None]

Output:

OTU map-based filtering:

Keep all sequences that show up in an OTU map.

filter_fasta.py -f inseqs.fasta -o otu_map_filtered_seqs.fasta -m otu_map.txt

Chimeric sequence filtering:

Discard all sequences that show up in chimera checking output. NOTE: It is very important to pass -n here as this tells the script to negate the request, or discard all sequences that are listed via -s. This is necessary to remove the identified chimeras from inseqs.fasta.

filter_fasta.py -f inseqs.fasta -o non_chimeric_seqs.fasta -s chimeric_seqs.txt -n

Sequence list filtering:

Keep all sequences from as fasta file that are listed in a text file.

filter_fasta.py -f inseqs.fasta -o list_filtered_seqs.fasta -s seqs_to_keep.txt

biom-based filtering:

Keep all sequences that are listed as observations in a biom file.

filter_fasta.py -f inseqs.fastq -o biom_filtered_seqs.fastq -b otu_table.biom

fastq filtering:

Keep all sequences from a fastq file that are listed in a text file (note: file name must end with .fastq to support fastq filtering).

filter_fasta.py -f inseqs.fastq -o list_filtered_seqs.fastq -s seqs_to_keep.txt

sample id list filtering:

Keep all sequences from a fasta file where the sample id portion of the sequence identifier is listed in a text file (sequence identifiers in fasta file must be in post-split libraries format: sampleID_seqID).

filter_fasta.py -f sl_inseqs.fasta -o sample_id_list_filtered_seqs.fasta --sample_id_fp map.txt

sampledoc