sampledoc
News and Announcements »

Input Files

General suggestions for formatting your files

These are general guidelines that apply to formatting files for use with QIIME, and command line tools in general:

  1. Files should have proper file type suffix. For example, .fna or .fasta for FASTA files; .qual for quality score files; .sff for sff files; .txt for mapping files.
  2. Do not use spaces in your filenames. Use underscores or MixedCase instead. For example, instead of amazon soil.fna use amazon_soil.fna or AmazonSoil.fna.
  3. Edit your files with a text editor such as TextEdit or TextMate (on Mac), gedit (on Linux), vim, or emacs, but not Microsoft Word, which is a word processor, not a text editor. Mapping files and OTU tables can be edited in Microsoft Excel, but should always be saved as tab-delimited text.

Metadata mapping files

Metadata mapping files are used through-out QIIME, and provide per-sample metadata. These are used in split_libraries.py, beta_diversity_through_plots.py, alpha_rarefaction.py and other scripts.

Mapping File Overview

The mapping file is generated by the user. This file contains all of the information about the samples necessary to perform the data analysis. In general, the mapping file should contain the name of each sample, the barcode sequence used for each sample, the linker/primer sequence used to amplify the sample, and a description column. One should also include in the mapping file any metadata that relates to the samples (for instance, health status or sampling site) and any additional information relating to specific samples that may be useful to have at hand when considering outliers (for example, what medications a patient was taking at time of sampling).

The mapping file relates barcodes in the FASTA file to each sample and their related metadata. Each FASTA file must have at least one mapping file but multiple mapping files can be defined for any given FASTA file. For example, if you have bundled several unrelated studies into one 454 run (for instance, a mouse study, a soil study and a fish study), and need to analyze each study separately, you would generate three separate mapping files that specify a subset of samples and their associated metadata. Alternatively, you can combine multiple runs (e.g. multiple 454 runs, multiple FASTA files) with a single mapping file.

Each column header MUST contain alphanumeric (a-z, A-Z and 1-9) and/or underscore (“_”) characters only, where the header MUST start with letter. All other characters (e.g. $, *, ^, etc) are not supported at this time and use of those characters may cause problems downstream in the QIIME pipeline.

Currently, the user has the ability to define their own column headers, however; QIIME will be adopting the MIMARKS standard, therefore all column headings MUST correspond the proper MIMARKS nomenclature (http://gensc.org/gc_wiki/index.php/MIMARKS). The following details the current mapping file guidelines:

  1. The first column header must be “#SampleID”, and the data in this column must contain unique (short and meaningful) sample identifiers containing only alphanumeric and period (”.”) characters. Leading and trailing spaces will raise a warning when using check_id_map.py.
  2. The second column header must be “BarcodeSequence”, where each value in that column corresponds to the barcode used for each sample. Only IUPAC DNA characters are acceptable. Leading and trailing spaces will raise a warning when using check_id_map.py.
  3. The third column header must be “LinkerPrimerSequence”, where each value in that column corresponds to the primer used to amplify that sample. Only IUPAC DNA characters are acceptable. Leading and trailing spaces will raise a warning when using check_id_map.py.
  4. All subsequent column headers (except the last one) are metadata headers. For example, a “Smoker” column would include either “Yes” or “No”. Note that the data in each column is assumed to be categorical unless specified otherwise. Categorical data columns must include at least 2 unique values per column. All metadata must be composed of only alphanumeric, underscore (“_”), period (”.”), minus sign (“-”), plus sign (“+”), percentage (“%”), space (” ”), semicolon (”;”), colon (”:”), comma (”,”), and/or forward slash (“/”) characters. For missing data, write “NA”; do not leave blanks.
  5. The last column of the mapping file must be named “Description”. Information in this column includes information that is unique to each sample, such as the medications taken by the patient, or any other descriptive information. The same character restrictions that apply to the metadata columns in guideline four apply to sample descriptions. Sample/Run Description should be kept brief, if possible. Information that applies to all samples in a mapping file should go in the run description section, which is defined as lines starting with a “#” character, immediately following the header line (See example format below.) Information that is specific to a particular sample should go in the “Description” column.
  6. There should be no empty lines or comment lines (starting with #) throughout the metadata, with the exception of any additional run description lines that immediately follow the initial header line.
  7. Quotes (”) will be stripped from the mapping file (header and data fields) when it is parsed by most scripts in QIIME. For check_id_map.py, these will be flagged with a warning.
  8. Stripping of leading and trailing whitespace is only performed on table cells (including sample IDs), not on the column headers. If quote characters (”) are present, these are removed first, followed by whitespace stripping.

The header for this mapping file starts with a pound (#) character, and generally requires a “SampleID”, “BarcodeSequence”, “LinkerPrimerSequence”, and a “Description”, all tab separated. The following example header represents the minimum field requirement for the mapping file:

Note

  • #SampleID BarcodeSequence LinkerPrimerSequence Description

Additional optional headers can follow the “LinkerPrimerSequence” header. Any lines following this header that start with the pound character are considered to be comment lines and are ignored by QIIME.

Data fields do not start with a pound character. These fields are tab separated, and have restrictions regarding character usage. SampleID fields only accept alphanumeric and period (.) characters. The other data fields will accept alphanumeric, period (.), underscore (_), percent (%), plus (+), minus (-), space ( ), semicolon (;), colon (:), comma (,), or forward slash (/) characters.

You are highly encouraged to validate your mapping file using check_id_map.py before attempting to analyze your data. This tool will check for errors, and make suggestions for other aspects of the file to be edited (errors and warnings are output to a log file, and suggested changes to invalid characters are output to a “_corrected.txt” file). The contents of a sample mapping file are shown here - as you can see, a nucleotide barcode sequence is provided for each of the 9 samples, as well as metadata related to treatment group and date of birth, and general run descriptions about the project:

Note

  • #SampleID BarcodeSequence LinkerPrimerSequence Treatment DOB Description
  • #Example mapping file for the QIIME analysis package. These 9 samples are from a study of the effects of
  • #exercise and diet on mouse cardiac physiology (Crawford, et al, PNAS, 2009).
  • PC.354 AGCACGAGCCTA YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._354
  • PC.355 AACTCGTCGATG YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._355
  • PC.356 ACAGACCACTCA YATGCTGCCTCCCGTAGGAGT Control 20061126 Control_mouse__I.D._356
  • PC.481 ACCAGCGACTAG YATGCTGCCTCCCGTAGGAGT Control 20070314 Control_mouse__I.D._481
  • PC.593 AGCAGCACTTGT YATGCTGCCTCCCGTAGGAGT Control 20071210 Control_mouse__I.D._593
  • PC.607 AACTGTGCGTAC YATGCTGCCTCCCGTAGGAGT Fast 20071112 Fasting_mouse__I.D._607
  • PC.634 ACAGAGTCGGCT YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._634
  • PC.635 ACCGCAGAGTCA YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._635
  • PC.636 ACGGTGAGTGTC YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._636

This example mapping file is available here: Example Mapping File (Right click and use ‘download’ or ‘save as’ to save this file)

During demultiplexing with split_libraries.py, the SampleID that is associated with the barcode found in a given sequence is used to label the output sequence. An example set of such assignments are seen in the Tutorial - Assign Samples to Multiplex Reads section. Note that in this example, the barcode associated with “PC.634”, “ACAGAGTCGGCT” was found in the first two sequences, and so the output “seqs.fna” file has these sequences labeled as “PC.634_1” and “PC.634_2” respectively. The third sequence contained the barcode “AGCACGAGCCTA”, and hence was associated with “PC.354”.

Generating a Mapping File by Hand

The easiest way to generate a mapping file is to use a spreadsheet program, such as Microsoft Excel. Each header and field should be in its own column. When saving the file, it is best to use the pre-built tab-delimited option. If this is not available for a particular spreadsheet program, set the format to text csv, the field delimiter as a tab, and leave the text delimiter blank. Once the file is saved, open it in a basic text editor to see if the formatting meets the criteria given above. Finally, use check_id_map.py to test the file for QIIME compatibility.

Fixing Problems in the Mapping File

check_id_map.py will test for many problems in the mapping file, such as incorrect character usage. A “_corrected.txt” form of the mapping file will be generated containing invalid characters replaced by allowed characters. The following is an example of an incorrectly formatted mapping file, with invalid characters, duplicated values that should be unique (“SampleID”, “BarcodeSequence”), non DNA characters in the “LinkerPrimerSequence”, and a missing “Description” cell.

Note

  • #SampleID BarcodeSequence LinkerPrimerSequence Treatment DOB Description
  • #Example mapping file for the QIIME analysis package. These 9 samples are from a study of the effects of
  • #exercise and diet on mouse cardiac physiology (Crawford, et al, PNAS, 2009).
  • PC&&&& AGCACGAGCCTA YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._354
  • PC.355 AGCACGAGCCTA YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._355
  • PC.355 ACAGACCACTCA YATGCTGCCTCCCGTAGGAGT Control 20061126 Control_mouse__I.D._356
  • PC_481 ACCAGCGACTAG ZATGCTGCCTCCCGTAGGAGT Control 20070314 Control_mouse__I.D._481
  • PC.593 AGCAGCACTTGT YATGCTGCCTCCCGTAGGAGT Control 20071210 Control_mouse__I.D._593
  • PC.607 AACTGTGCGTAC YATGCTGCCTCCCGTAGGAGT Fast^2 20071112 Fasting_mouse__I.D._607
  • PC.634 ACAGAGTCGGCT YATGCTGCCTCCCGTAGGAGT Fast 20080116
  • PC.635 ACCGCAGAGTCA YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._635
  • PC.636 ACGGTGAGTGTC YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._636

The corrected mapping file will replace invalid characters and fill in missing “Description” fields. The example corrected mapping file output is below:

Note

  • #SampleID BarcodeSequence LinkerPrimerSequence Treatment DOB Description
  • #Example mapping file for the QIIME analysis package. These 9 samples are from a study of the effects of
  • #exercise and diet on mouse cardiac physiology (Crawford, et al, PNAS, 2009).
  • PC.... AGCACGAGCCTA YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._354
  • PC.355 AGCACGAGCCTA YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._355
  • PC.355 ACAGACCACTCA YATGCTGCCTCCCGTAGGAGT Control 20061126 Control_mouse__I.D._356
  • PC.481 ACCAGCGACTAG ZATGCTGCCTCCCGTAGGAGT Control 20070314 Control_mouse__I.D._481
  • PC.593 AGCAGCACTTGT YATGCTGCCTCCCGTAGGAGT Control 20071210 Control_mouse__I.D._593
  • PC.607 AACTGTGCGTAC YATGCTGCCTCCCGTAGGAGT Fast_2 20071112 Fasting_mouse__I.D._607
  • PC.634 ACAGAGTCGGCT YATGCTGCCTCCCGTAGGAGT Fast 20080116 missing_description
  • PC.635 ACCGCAGAGTCA YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._635
  • PC.636 ACGGTGAGTGTC YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._636

However, this corrected mapping file is still not usable. The log file generated by check_id_map.py explains the remaining problems. The barcode “AGCACGAGCCTA” is duplicated, and appears in the first two rows. Rows two and three contain the same “SampleID” value. These errors will have to be fixed by hand. Secondly, the “Z” character in the fourth row “LinkerPrimerSequence” is not a valid IUPAC DNA character and needs to be replaced with a legitimate nucleotide code.

Mapping Files Without Barcodes and/or Primers

In some circumstances, users may need to generate a mapping file that does not contain barcodes and/or primers. To generate such a mapping file, fields for “BarcodeSequence” and “LinkerPrimerSequence” can be left empty. An example of such a file is below (note that the tabs are still present for the empty “BarcodeSequence” and “LinkerPrimerSequence” fields):

Note

  • #SampleID BarcodeSequence LinkerPrimerSequence Treatment DOB Description
  • #Example mapping file for the QIIME analysis package. These 9 samples are from a study of the effects of
  • #exercise and diet on mouse cardiac physiology (Crawford, et al, PNAS, 2009).
  • PC.354 Control 20061218 Control_mouse__I.D._354
  • PC.355 Control 20061218 Control_mouse__I.D._355
  • PC.356 Control 20061126 Control_mouse__I.D._356
  • PC.481 Control 20070314 Control_mouse__I.D._481
  • PC.593 Control 20071210 Control_mouse__I.D._593
  • PC.607 Fast 20071112 Fasting_mouse__I.D._607
  • PC.634 Fast 20080116 Fasting_mouse__I.D._634
  • PC.635 Fast 20080116 Fasting_mouse__I.D._635
  • PC.636 Fast 20080116 Fasting_mouse__I.D._636

To validate such a mapping file, the user will need to disable barcode and primer testing with the -p and -b parameters:

check_id_map.py -m <mapping_filepath> -o check_id_output/ -p -b

The above mapping file will still show a warning-as it is lacking any barcodes, it has no way to differentiate sequences, and thus can not be used for demultiplexing. However, such warnings can be ignored if the mapping file is being used for steps downstream of demultiplexing.

Demultiplexed sequences

Post- split_libraries FASTA File Overview

When performing a typical workflow, it is not necessary for users to put together the specially formatted post-split-libraries FASTA file. Thus, this section is primarily useful for users who would like to use the downstream capabilities of QIIME without running split_libraries.py. For a description of the essential files for the typical workflow see their description in the QIIME Tutorial.

The purpose of the post-split_libraries FASTA is to relate each sequence to the sample from which it came, while also recording information about the original and error-corrected barcodes from which this inference was made.

Here is an example of the post-split libraries FASTA file format:

Note

  • >PC.634_1 FLP3FBN01ELBSX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0
  • CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTT
  • >PC.634_2 FLP3FBN01EG8AX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0
  • TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTA
  • >PC.354_3 FLP3FBN01EEWKD orig_bc=AGCACGAGCCTA new_bc=AGCACGAGCCTA bc_diffs=0
  • TTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGGCTATGCATCATTGCCTTGGTAAGCCGTT
  • >PC.481_4 FLP3FBN01DEHK3 orig_bc=ACCAGCGACTAG new_bc=ACCAGCGACTAG bc_diffs=0
  • CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTCAACCTCTCAGTCCGGCTACTGATCGTCGACTTGGTGAGCCGTT

An example of the post-split_libraries FASTA file is available here: Example Post Split Libraries Sequence File

(Right click and use ‘download’ or ‘save as’ to save this file. In general it is preferable to download these files directly rather than opening them in your browser and then cutting and pasting the text into a word-processor such as Microsoft Word or OpenOffice, as these programs often silently introduce small but important changes in the file format.)

The post-split libraries FASTA file is a typical FASTA file, with a few special fields in the label line.

The important things to notice about the format are:

Note

    1. The file is a FASTA file, with sequences in the single line format. That is, sequences are not broken up into multiple lines of a particular length, but instead the entire sequence occupies a single line.
    1. The label line is separated by spaces and has five fields. In order, those fields are: the sample id of the sample that the sequence came from (e.g. PC.634_1), the unique sequence id (e.g. FLP3FBN01ELBSX), the original barcode (e.g. orig_bc=ACAGAGTCGGCT), the new barcode after error-correction (e.g. new_bc=ACAGAGTCGGCT), and the number of positions that differ between the original and new barcode (e.g. bc_diffs=0).
    1. Note that the first two fields (the sample id and sequence id) don’t require anything ahead of the ids, the last three (orig_bc, new_bc, and bc_diffs) require the name of the field and an equals sign immediately ahead of the value (e.g. ‘bc_diffs=0’ not ‘bc_diffs = 0’ or just ‘0’)

Handling Already Demultiplexed Samples

Demultiplexed sequence files are passed to pick_otus.py, and used when skipping the split_libraries.py step when your sequences are already demultiplexed. In order for the downstream modules of QIIME to associate sequences with particular samples, these demultiplexed sequences need to be labeled in such a way that the SampleID (see mapping file format) and sequence number are incorporated into the fasta label.

For instance, if the following fasta sequence:

Note

  • >FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_
  • GCAGAGTCGGCTCATGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGC

was the first sequence in the fasta file, and it was associated with the sample PC.634, the demultiplexed sequence should be listed as so (note that the barcode and primer are removed from the sequence):

Note

  • >PC.634_1 FLP3FBN01ELBSX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0
  • CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGC

OTU table

OTU tables are sample x observation matrices, and are central to a lot of downstream analysis in QIIME. These are generated by pick_de_novo_otus.py but can also be generated externally from QIIME (e.g., exported from MG-RAST for metagenomic analysis with QIIME). These are used in scripts such as beta_diversity_through_plots.py, alpha_rarefaction.py, and summarize_taxa_through_plots.py.py.

OTU Table overview

The OTU table file format holds information about which OTUs are found in each sample. For a typical QIIME run, it is not necessary to manually construct an OTU table, as this is done automatically from your sequences. However, for some applications it is useful to be able to use the downstream capabilities of the QIIME workflow starting directly from an OTU table. For more information about the OTU table format, which relies on the biom-format, please go here: biom-format

An example OTU file is available here: Example OTU Table

(Right click and use ‘download’ or ‘save as’ to save this file. In general it is preferable to download these files directly rather than opening them in your browser and then cutting and pasting the text into a word-processor such as Microsoft Word or OpenOffice, as these programs often silently introduce small but important changes in the file format.)

ID-to-taxonomy map

ID-to-taxonomy maps are pass to assign_taxonomy.py -m blast via the -t/--id_to_taxonomy_fp option with an associated fasta file passed via -r/--reference_seqs_fp.

Sequence ID to Taxonomy Mapping Files

Several QIIME modules, such as assign_taxonomy.py, require a sequence ID to taxonomy mapping file when one is using a custom training sequence set or BLAST database. ID to taxonomy mapping files are tab delimited, with the sequence ID as the first column, and a semicolon-separated taxonomy, in descending order, as the second column. An example of an ID to taxonomy mapping file is show below:

Note

  • 339039 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;unclassified_Rhodospirillales
  • 199390 Bacteria;Chloroflexi;Anaerolineae;Caldilineae;Caldilineales;Caldilineacea;unclassified_Caldilineacea
  • 370251 Bacteria;Proteobacteria;Gammaproteobacteria;unclassified_Gammaproteobacteria
  • 11544 Bacteria;Actinobacteria;Actinobacteria;Actinobacteridae;Actinomycetales;unclassified_Actinomycetales
  • 460067 Unclassified
  • 256904 Bacteria
  • 286896 Bacteria;Actinobacteria;Actinobacteria;Actinobacteridae;Actinomycetales;Micrococcineae;Micrococcaceae;Kocuria
  • 127471 Bacteria;Bacteroidetes;Sphingobacteria;Sphingobacteriales;Crenotrichaceae;Terrimonas
  • 155634 Archaea;Euryarchaeota;Methanobacteria;Methanobacteriales;Methanobacteriaceae;Methanosphaera

This file can be downloaded here: Example ID to Taxonomy Mapping File (Right click and use ‘download’ or ‘save as’ to save this file)

Several Greegenes (http://greengenes.lbl.gov/) sequence ID to taxonomy mapping files are available for download in our Greengenes OTU build. To ensure you have the latest version, follow the link to Most recent Greengenes OTUs on the top right of this page.

To add taxonomy mapping to an existing sequence ID to taxonomy mapping file, open the existing taxonomy mapping file in a spreadsheet, such as Microsoft Excel. Save new sequence IDs in the first column, and the semicolon-separated taxa in the second column (make sure there are not extra spaces, tabs, or other white space around these entries). Save this modified mapping file with the field delimiter as a tab, and leave the text delimiter blank. It is best to visually inspect the modified ID to taxonomy mapping file in a basic text editor to ensure that no extraneous characters or spacings were saved during this process.

QIIME parameters

The QIIME parameters files is used to pass per-script parameters to the QIIME ‘workflow’ scripts. You can find details on these files in QIIME parameters files.

Sample id map

Some scripts which compare paired samples, including transform_coordinate_matrices.py and compare_taxa_summaries.py, take a parameter, --sample_id_map_fp, which is necessary when comparing data sets with different sample IDs. This file, a sample id map (which is different than a QIIME mapping file), describes how to map from the sample IDs associated with the input data to a new sample id that will be consistent across the data sets being compared. For example, if your first data set contains samples S1, S2, and S3, and these should be paired with samples T1, T2, and T3 in your second data set, your sample id map might look like:

S1      1
S2      2
S3      3
T1      1
T2      2
T3      3

The reason for this format is that it’s usually sample metadata from one or more columns in the QIIME mapping files associated with each data set that allows you to match samples to one another. With this format you can select one or more columns from each QIIME mapping file (concatenating some fields, if necessary) to build the sample id map.

To clarify, this format maps from input sample id to new sample id, not from sample id in matrix 1 to sample id in matrix 2.


sampledoc