News and Announcements » |
These are general guidelines that apply to formatting files for use with QIIME, and command line tools in general:
Metadata mapping files are used through-out QIIME, and provide per-sample metadata. These are used in split_libraries.py, beta_diversity_through_plots.py, alpha_rarefaction.py and other scripts.
The mapping file is generated by the user. This file contains all of the information about the samples necessary to perform the data analysis. In general, the mapping file should contain the name of each sample, the barcode sequence used for each sample, the linker/primer sequence used to amplify the sample, and a description column. One should also include in the mapping file any metadata that relates to the samples (for instance, health status or sampling site) and any additional information relating to specific samples that may be useful to have at hand when considering outliers (for example, what medications a patient was taking at time of sampling).
The mapping file relates barcodes in the FASTA file to each sample and their related metadata. Each FASTA file must have at least one mapping file but multiple mapping files can be defined for any given FASTA file. For example, if you have bundled several unrelated studies into one 454 run (for instance, a mouse study, a soil study and a fish study), and need to analyze each study separately, you would generate three separate mapping files that specify a subset of samples and their associated metadata. Alternatively, you can combine multiple runs (e.g. multiple 454 runs, multiple FASTA files) with a single mapping file.
Each column header MUST contain alphanumeric (a-z, A-Z and 1-9) and/or underscore (“_”) characters only, where the header MUST start with letter. All other characters (e.g. $, *, ^, etc) are not supported at this time and use of those characters may cause problems downstream in the QIIME pipeline.
Currently, the user has the ability to define their own column headers, however; QIIME will be adopting the MIMARKS standard, therefore all column headings MUST correspond the proper MIMARKS nomenclature (http://gensc.org/gc_wiki/index.php/MIMARKS). The following details the current mapping file guidelines:
The header for this mapping file starts with a pound (#) character, and generally requires a “SampleID”, “BarcodeSequence”, “LinkerPrimerSequence”, and a “Description”, all tab separated. The following example header represents the minimum field requirement for the mapping file:
Note
Additional optional headers can follow the “LinkerPrimerSequence” header. Any lines following this header that start with the pound character are considered to be comment lines and are ignored by QIIME.
Data fields do not start with a pound character. These fields are tab separated, and have restrictions regarding character usage. SampleID fields only accept alphanumeric and period (.) characters. The other data fields will accept alphanumeric, period (.), underscore (_), percent (%), plus (+), minus (-), space ( ), semicolon (;), colon (:), comma (,), or forward slash (/) characters.
You are highly encouraged to validate your mapping file using check_id_map.py before attempting to analyze your data. This tool will check for errors, and make suggestions for other aspects of the file to be edited (errors and warnings are output to a log file, and suggested changes to invalid characters are output to a “_corrected.txt” file). The contents of a sample mapping file are shown here - as you can see, a nucleotide barcode sequence is provided for each of the 9 samples, as well as metadata related to treatment group and date of birth, and general run descriptions about the project:
Note
This example mapping file is available here: Example Mapping File (Right click and use ‘download’ or ‘save as’ to save this file)
During demultiplexing with split_libraries.py, the SampleID that is associated with the barcode found in a given sequence is used to label the output sequence. An example set of such assignments are seen in the Tutorial - Assign Samples to Multiplex Reads section. Note that in this example, the barcode associated with “PC.634”, “ACAGAGTCGGCT” was found in the first two sequences, and so the output “seqs.fna” file has these sequences labeled as “PC.634_1” and “PC.634_2” respectively. The third sequence contained the barcode “AGCACGAGCCTA”, and hence was associated with “PC.354”.
The easiest way to generate a mapping file is to use a spreadsheet program, such as Microsoft Excel. Each header and field should be in its own column. When saving the file, it is best to use the pre-built tab-delimited option. If this is not available for a particular spreadsheet program, set the format to text csv, the field delimiter as a tab, and leave the text delimiter blank. Once the file is saved, open it in a basic text editor to see if the formatting meets the criteria given above. Finally, use check_id_map.py to test the file for QIIME compatibility.
check_id_map.py will test for many problems in the mapping file, such as incorrect character usage. A “_corrected.txt” form of the mapping file will be generated containing invalid characters replaced by allowed characters. The following is an example of an incorrectly formatted mapping file, with invalid characters, duplicated values that should be unique (“SampleID”, “BarcodeSequence”), non DNA characters in the “LinkerPrimerSequence”, and a missing “Description” cell.
Note
The corrected mapping file will replace invalid characters and fill in missing “Description” fields. The example corrected mapping file output is below:
Note
However, this corrected mapping file is still not usable. The log file generated by check_id_map.py explains the remaining problems. The barcode “AGCACGAGCCTA” is duplicated, and appears in the first two rows. Rows two and three contain the same “SampleID” value. These errors will have to be fixed by hand. Secondly, the “Z” character in the fourth row “LinkerPrimerSequence” is not a valid IUPAC DNA character and needs to be replaced with a legitimate nucleotide code.
In some circumstances, users may need to generate a mapping file that does not contain barcodes and/or primers. To generate such a mapping file, fields for “BarcodeSequence” and “LinkerPrimerSequence” can be left empty. An example of such a file is below (note that the tabs are still present for the empty “BarcodeSequence” and “LinkerPrimerSequence” fields):
Note
To validate such a mapping file, the user will need to disable barcode and primer testing with the -p and -b parameters:
check_id_map.py -m <mapping_filepath> -o check_id_output/ -p -b
When performing a typical workflow, it is not necessary for users to put together the specially formatted post-split-libraries FASTA file. Thus, this section is primarily useful for users who would like to use the downstream capabilities of QIIME without running split_libraries.py. For a description of the essential files for the typical workflow see their description in the QIIME Tutorial.
The purpose of the post-split_libraries FASTA is to relate each sequence to the sample from which it came, while also recording information about the original and error-corrected barcodes from which this inference was made.
Here is an example of the post-split libraries FASTA file format:
Note
An example of the post-split_libraries FASTA file is available here: Example Post Split Libraries Sequence File
(Right click and use ‘download’ or ‘save as’ to save this file. In general it is preferable to download these files directly rather than opening them in your browser and then cutting and pasting the text into a word-processor such as Microsoft Word or OpenOffice, as these programs often silently introduce small but important changes in the file format.)
The post-split libraries FASTA file is a typical FASTA file, with a few special fields in the label line.
The important things to notice about the format are:
Note
Demultiplexed sequence files are passed to pick_otus.py, and used when skipping the split_libraries.py step when your sequences are already demultiplexed. In order for the downstream modules of QIIME to associate sequences with particular samples, these demultiplexed sequences need to be labeled in such a way that the SampleID (see mapping file format) and sequence number are incorporated into the fasta label.
For instance, if the following fasta sequence:
Note
was the first sequence in the fasta file, and it was associated with the sample PC.634, the demultiplexed sequence should be listed as so (note that the barcode and primer are removed from the sequence):
Note
OTU tables are sample x observation matrices, and are central to a lot of downstream analysis in QIIME. These are generated by pick_otus_through_otu_table.py but can also be generated externally from QIIME (e.g., exported from MG-RAST for metagenomic analysis with QIIME). These are used in scripts such as beta_diversity_through_plots.py, alpha_rarefaction.py, and summarize_taxa_through_plots.py.py.
The OTU table file format holds information about which OTUs are found in each sample. For a typical QIIME run, it is not necessary to manually construct an OTU table, as this is done automatically from your sequences. However, for some applications it is useful to be able to use the downstream capabilities of the QIIME workflow starting directly from an OTU table. For more information about the OTU table format, which relies on the biom-format, please go here: biom-format
An example OTU file is available here: Example OTU Table
(Right click and use ‘download’ or ‘save as’ to save this file. In general it is preferable to download these files directly rather than opening them in your browser and then cutting and pasting the text into a word-processor such as Microsoft Word or OpenOffice, as these programs often silently introduce small but important changes in the file format.)
ID-to-taxonomy maps are pass to assign_taxonomy.py -m blast via the -t/--id_to_taxonomy_fp option with an associated fasta file passed via -r/--reference_seqs_fp.
Several QIIME modules, such as assign_taxonomy.py, require a sequence ID to taxonomy mapping file when one is using a custom training sequence set or BLAST database. ID to taxonomy mapping files are tab delimited, with the sequence ID as the first column, and a semicolon-separated taxonomy, in descending order, as the second column. An example of an ID to taxonomy mapping file is show below:
Note
This file can be downloaded here: Example ID to Taxonomy Mapping File (Right click and use ‘download’ or ‘save as’ to save this file)
Several Greegenes (http://greengenes.lbl.gov/) sequence ID to taxonomy mapping files are available for download in our Greengenes OTU build. To ensure you have the latest version, follow the link to Most recent Greengenes OTUs on the top right of this page.
To add taxonomy mapping to an existing sequence ID to taxonomy mapping file, open the existing taxonomy mapping file in a spreadsheet, such as Microsoft Excel. Save new sequence IDs in the first column, and the semicolon-separated taxa in the second column (make sure there are not extra spaces, tabs, or other white space around these entries). Save this modified mapping file with the field delimiter as a tab, and leave the text delimiter blank. It is best to visually inspect the modified ID to taxonomy mapping file in a basic text editor to ensure that no extraneous characters or spacings were saved during this process.
The QIIME parameters files is used to pass per-script parameters to the QIIME ‘workflow’ scripts. You can find details on these files in QIIME parameters files.