News and Announcements » |
These are general guidelines that apply to formatting files for use with QIIME, and command line tools in general:
Metadata mapping files are used through-out QIIME, and provide per-sample metadata. These are used in split_libraries.py, beta_diversity_through_plots.py, alpha_rarefaction.py and other scripts.
The mapping file is generated by the user. This file contains all of the information about the samples necessary to perform the data analysis. In general, the mapping file should contain the name of each sample, the barcode sequence used for each sample, the linker/primer sequence used to amplify the sample, and a description column. One should also include in the mapping file any metadata that relates to the samples (for instance, health status or sampling site) and any additional information relating to specific samples that may be useful to have at hand when considering outliers (for example, what medications a patient was taking at time of sampling).
The mapping file relates barcodes in the FASTA file to each sample and their related metadata. Each FASTA file must have at least one mapping file but multiple mapping files can be defined for any given FASTA file. For example, if you have bundled several unrelated studies into one 454 run (for instance, a mouse study, a soil study and a fish study), and need to analyze each study separately, you would generate three separate mapping files that specify a subset of samples and their associated metadata. Alternatively, you can combine multiple runs (e.g. multiple 454 runs, multiple FASTA files) with a single mapping file.
Each column header MUST contain alphanumeric (a-z, A-Z and 1-9) and/or underscore (“_”) characters only, where the header MUST start with letter. All other characters (e.g. $, *, ^, etc) are not supported at this time and use of those characters may cause problems downstream in the QIIME pipeline.
Currently, the user has the ability to define their own column headers, however; QIIME will be adopting the MIMARKS standard, therefore all column headings MUST correspond the proper MIMARKS nomenclature (http://wiki.gensc.org/index.php?title=MIMARKS). The following details the current mapping file guidelines:
The header for this mapping file starts with a pound (#) character, and generally requires a “SampleID”, “BarcodeSequence”, “LinkerPrimerSequence”, and a “Description”, all tab separated. The following example header represents the minimum field requirement for the mapping file:
Note
Additional optional headers can follow the “LinkerPrimerSequence” header. Any lines following this header that start with the pound character are considered to be comment lines and are ignored by QIIME.
Data fields do not start with a pound character. These fields are tab separated, and have restrictions regarding character usage. SampleID fields only accept alphanumeric and period (.) characters. The other data fields will accept alphanumeric, period (.), underscore (_), percent (%), plus (+), minus (-), space ( ), semicolon (;), colon (:), comma (,), or forward slash (/) characters.
You are highly encouraged to validate your mapping file using validate_mapping_file.py before attempting to analyze your data. This tool will check for errors, and make suggestions for other aspects of the file to be edited (errors and warnings are output to a log file, and suggested changes to invalid characters are output to a “_corrected.txt” file). The contents of a sample mapping file are shown here - as you can see, a nucleotide barcode sequence is provided for each of the 9 samples, as well as metadata related to treatment group and date of birth, and general run descriptions about the project:
Note
This example mapping file is available here: Example Mapping File (Right click and use ‘download’ or ‘save as’ to save this file)
During demultiplexing with split_libraries.py, the SampleID that is associated with the barcode found in a given sequence is used to label the output sequence. An example set of such assignments are seen in the Tutorial - Assign Samples to Multiplex Reads section. Note that in this example, the barcode associated with “PC.634”, “ACAGAGTCGGCT” was found in the first two sequences, and so the output “seqs.fna” file has these sequences labeled as “PC.634_1” and “PC.634_2” respectively. The third sequence contained the barcode “AGCACGAGCCTA”, and hence was associated with “PC.354”.
The easiest way to generate a mapping file is to use a spreadsheet program, such as Microsoft Excel. Each header and field should be in its own column. When saving the file, it is best to use the pre-built tab-delimited option. If this is not available for a particular spreadsheet program, set the format to text csv, the field delimiter as a tab, and leave the text delimiter blank. Once the file is saved, open it in a basic text editor to see if the formatting meets the criteria given above. Finally, use validate_mapping_file.py to test the file for QIIME compatibility.
validate_mapping_file.py will test for many problems in the mapping file, such as incorrect character usage. A “_corrected.txt” form of the mapping file will be generated containing invalid characters replaced by allowed characters. The following is an example of an incorrectly formatted mapping file, with invalid characters, duplicated values that should be unique (“SampleID”, “BarcodeSequence”), non DNA characters in the “LinkerPrimerSequence”, and a missing “Description” cell.
Note
The corrected mapping file will replace invalid characters and fill in missing “Description” fields. The example corrected mapping file output is below:
Note
However, this corrected mapping file is still not usable. The log file generated by validate_mapping_file.py explains the remaining problems. The barcode “AGCACGAGCCTA” is duplicated, and appears in the first two rows. Rows two and three contain the same “SampleID” value. These errors will have to be fixed by hand. Secondly, the “Z” character in the fourth row “LinkerPrimerSequence” is not a valid IUPAC DNA character and needs to be replaced with a legitimate nucleotide code.
In some circumstances, users may need to generate a mapping file that does not contain barcodes and/or primers. To generate such a mapping file, fields for “BarcodeSequence” and “LinkerPrimerSequence” can be left empty. An example of such a file is below (note that the tabs are still present for the empty “BarcodeSequence” and “LinkerPrimerSequence” fields):
Note
To validate such a mapping file, the user will need to disable barcode and primer testing with the -p and -b parameters:
validate_mapping_file.py -m <mapping_filepath> -o check_id_output/ -p -b
The above mapping file will still show a warning-as it is lacking any barcodes, it has no way to differentiate sequences, and thus can not be used for demultiplexing. However, such warnings can be ignored if the mapping file is being used for steps downstream of demultiplexing.
When performing a typical workflow, it is not necessary for users to put together the specially formatted post-split-libraries FASTA file. Thus, this section is primarily useful for users who would like to use the downstream capabilities of QIIME without running split_libraries.py. For a description of the essential files for the typical workflow see their description in the QIIME Tutorial.
The purpose of the post-split_libraries FASTA is to relate each sequence to the sample from which it came, while also recording information about the original and error-corrected barcodes from which this inference was made.
Here is an example of the post-split libraries FASTA file format:
Note
An example of the post-split_libraries FASTA file is available here: Example Post Split Libraries Sequence File
(Right click and use ‘download’ or ‘save as’ to save this file. In general it is preferable to download these files directly rather than opening them in your browser and then cutting and pasting the text into a word-processor such as Microsoft Word or OpenOffice, as these programs often silently introduce small but important changes in the file format.)
The post-split libraries FASTA file is a typical FASTA file, with a few special fields in the label line.
The important things to notice about the format are:
Note
Demultiplexed sequence files are passed to pick_otus.py, and used when skipping the split_libraries.py step when your sequences are already demultiplexed. In order for the downstream modules of QIIME to associate sequences with particular samples, these demultiplexed sequences need to be labeled in such a way that the SampleID (see mapping file format) and sequence number are incorporated into the fasta label.
For instance, if the following fasta sequence:
Note
was the first sequence in the fasta file, and it was associated with the sample PC.634, the demultiplexed sequence should be listed as so (note that the barcode and primer are removed from the sequence):
Note
The Biological Observation Matrix (or BIOM, canonically pronounced biome) table is the core data type for downstream analyses in QIIME. It is a matrix of counts of observations on a per-sample basis. Most commonly, the observations are OTUs or taxa, and the samples are the unit of sampling in a study (e.g., a microbiome sample from the skin of one individual at one time point). These tables are often referred to as OTU tables in QIIME (but really an OTU table is one type of a BIOM table). BIOM tables are stored in the BIOM file format. For more information about the BIOM file format, please visit http://biom-format.org.
OTU tables are generated during the OTU picking process (e.g., pick_open_reference_otus.py, pick_closed_reference_otus.py, or pick_de_novo_otus.py) but can also be generated externally from QIIME (e.g., exported from MG-RAST for metagenomic analysis with QIIME). They are used in scripts such as core_diversity_analyses.py, beta_diversity_through_plots.py, alpha_rarefaction.py, and summarize_taxa_through_plots.py. For more information about working with BIOM tables in QIIME, please refer to Working with BIOM tables in QIIME.
As of version 1.8.0-dev, QIIME supports BIOM tables stored in version 1.0 and 2.1 of the BIOM file format. The main distinction between these two versions is the underlying file format: JSON is used for version 1.0 and HDF5 is used for version 2.1. Version 2.1 is recommended for large datasets as it provides an efficient way to store and access thousands of samples by millions of observations.
QIIME is designed to work seamlessly with BIOM tables stored in either version, so you shouldn’t need to worry too much about which version your BIOM table is stored in. If the HDF5 libraries and h5py are installed, QIIME will create BIOM tables in version 2.1 of the file format. If h5py and HDF5 are not installed, QIIME will create BIOM tables in version 1.0 of the file format. You can use biom convert to convert between file formats if necessary.
To see if h5py and HDF5 are installed on your system (thus enabling support for version 2.1 of the BIOM file format), run print_qiime_config.py and look for a line of output that is similar to the following (note that version numbers may differ):
h5py version: 2.4.0 (HDF5 version: 1.8.13)
If you instead see the following output, you do not have h5py and/or HDF5 installed, so QIIME will create BIOM tables in version 1.0 of the file format (and support for version 2.1 is disabled):
h5py version: Not installed.
Note that in order to interact with an existing BIOM table stored in version 2.1 of the file format, you will need h5py and HDF5 installed so that QIIME can load the table. This may be necessary, for example, if a collaborator generated a version 2.1 BIOM table and you plan to use QIIME to perform further analyses with the table.
ID-to-taxonomy maps are pass to assign_taxonomy.py -m blast via the -t/--id_to_taxonomy_fp option with an associated fasta file passed via -r/--reference_seqs_fp.
Several QIIME modules, such as assign_taxonomy.py, require a sequence ID to taxonomy mapping file when one is using a custom training sequence set or BLAST database. ID to taxonomy mapping files are tab delimited, with the sequence ID as the first column, and a semicolon-separated taxonomy, in descending order, as the second column. An example of an ID to taxonomy mapping file is show below:
Note
This file can be downloaded here: Example ID to Taxonomy Mapping File (Right click and use ‘download’ or ‘save as’ to save this file)
Several Greegenes (http://greengenes.lbl.gov/) sequence ID to taxonomy mapping files are available for download in our Greengenes OTU build. To ensure you have the latest version, follow the link to Most recent Greengenes OTUs on the top right of this page.
To add taxonomy mapping to an existing sequence ID to taxonomy mapping file, open the existing taxonomy mapping file in a spreadsheet, such as Microsoft Excel. Save new sequence IDs in the first column, and the semicolon-separated taxa in the second column (make sure there are not extra spaces, tabs, or other white space around these entries). Save this modified mapping file with the field delimiter as a tab, and leave the text delimiter blank. It is best to visually inspect the modified ID to taxonomy mapping file in a basic text editor to ensure that no extraneous characters or spacings were saved during this process.
The QIIME parameters files is used to pass per-script parameters to the QIIME ‘workflow’ scripts. You can find details on these files in QIIME parameters files.
Some scripts which compare paired samples, including transform_coordinate_matrices.py and compare_taxa_summaries.py, take a parameter, --sample_id_map_fp, which is necessary when comparing data sets with different sample IDs. This file, a sample id map (which is different than a QIIME mapping file), describes how to map from the sample IDs associated with the input data to a new sample id that will be consistent across the data sets being compared. For example, if your first data set contains samples S1, S2, and S3, and these should be paired with samples T1, T2, and T3 in your second data set, your sample id map might look like:
S1 1
S2 2
S3 3
T1 1
T2 2
T3 3
The reason for this format is that it’s usually sample metadata from one or more columns in the QIIME mapping files associated with each data set that allows you to match samples to one another. With this format you can select one or more columns from each QIIME mapping file (concatenating some fields, if necessary) to build the sample id map.
To clarify, this format maps from input sample id to new sample id, not from sample id in matrix 1 to sample id in matrix 2.