group_significance.py – Compare OTU frequencies across sample groups
Description:
This script is used to compare OTU frequencies in sample groups and to ascertain
whether or not there are statistically significant differences between the OTU
abundance in the different sample groups.
The script will compare each OTU based
on the passed sample groupings to see if it is differentially represented. The
sample groupings are determined by the -c option. The script will group together
samples which have the same value in the mapping file under the header passed
with the -c option. Any samples that do not contain a value under the given
header will not be included in the comparison.
At a basic level, the script is constructing a OTUxSample
(rowXcolumn) contingency table, and testing whether or not each OTU is
differentially represented in cerstain groups of columns (determined by the
metadata category passed).
There are several important considerations with this script.
Errors and behind-the-scenes processing:
- This script will ignore samples that are found in the mapping file but do not
contain information for the passed category. This would cause the mapping
file to fail validate_mapping_file.py, but will not cause a failure here.
- This script will silenty ignore situations where the set of samples in the
mapping file is a superset of the samples in the biom file. If the reverse is
true, the script will error unless –biom_samples_are_superset is passed.
- This script will round P-values greater than 1 (after correcting for multiple
comparisons) to 1.0.
- If your results file contains nans for p values its because one or more of
the assumptions the selected test makes about the data was not met by the
given OTU. The inverse of this statement is not guaranteed; just because the
test worked on the data doesn’t mean all its assumptions are met, just that
enough assumptions are met so it doesn’t fail (see below).
Filtering your OTU table prior to this script is important:
- Filtering out OTUs which are found in a low percentage of samples is a good
idea before using this script. The old otu_category_significance script
removed OTUs that were not found in at least 25 percent of samples. This
prevents 0 variance errors and spurious significance for really low abundance
OTUs and focuses the hypothesis discovery process on the abundant OTUs which
are likely playing a larger role.
Test assumptions:
- This script tests that some basic assumptions of the given statistical test
are met by the passed data. These ‘assumption tests’ are necessary not
sufficient to ensure that the given statistical test you are applying is
appropriate for the data. For instance, the script will error if you use the
Mann-Whitney-U test and one of your group sizes is smaller than 20. It is
likely that assumptions about the distribution of the data, the distribution
of the variance, etc. are not robustly met. IT IS YOUR REPSONSIBILTY TO CHECK
THAT YOU ARE USING AN APPROPRIATE TEST. For more information on assumptions
made by the tests, please view the following resources:
- Biometry by Sokal and Rolhf
- Nonparamteric Statistical Methods by Hollander and Wolfe
- Documentation in R and Scipy packages
- Handbook of Biological Statistics by McDonald (available at
http://udel.edu/~mcdonald/statintro.html)
The assumptions we check for:
- Kruskal-Wallis: No assumptions are checked for Kruskal-Wallis.
- G-test: We check that all the values in the table are non-negative and we
check that each sample grouping contains at least one non 0 value. If either
condition is not met we return G-stat, pval = (nan, nan) but don’t error.
- Mann-Whitney-U: The number of data points in the combined samples is >= 20. If
there are fewer than 21 data points the script will error. Although R gives
exact values for less than 50 data points, its definition of ‘exact’ is
unclear since a conditional permutation calculation with ~ 10^14 calculations
would be required. The normal approximation is suggested for more than 16 data
points by Mann and Whitney 1947, and more than 20 in the Scipy documentation.
The bootstrapped version of the test does not require >20 data points. If all
the data ranks are tied for one of the groups, the function will return U,nan.
- ANOVA: No assumptions are checked for ANOVA. However, if the within group
variance is 0, we return nan,nan.
- T-test: No assumptions are checked for the T-test. If no variance groups are
detected None or nan will be returned.
The assumptions we do not check for are:
- Kruskal-Wallis: The scipy documentation indicates that each group must have
at least 5 samples to make the Chi Squared approximation for the distribution
of the H statistic be appropriate. R has no such requirement, and we do not
implement the requirement here. The KW test does assume that the distributions
from which the samples come are the same (although they may be non-normal)
except for their location parameter, and we do not check this.
- G-test: we check that the data are counts rather than relative abundance.
- Mann-Whitney-U: Equality of variance between groups. Sample 1 is IID, Sample 2
is IID. Sample 1 and Sample 2 are mutually independent.
- ANOVA: ANOVA assumes equality of variance between groups (homoscedasticity),
normality of the residuals, and independence of the individual observations in
the samples. None of these conditions are checked by this script.
- T-test: the t-test assumes that the samples come from populations which are
normally distributed. that the variances of the sampled groups are ~ equal and
that the individual observations are independent. None of these conditions are
checked by this script.
Null and alternate hypothesis:
- G-test: The null hypothesis for the g_test (aka goodness of fit,
log-likelihood ratio test) is that the frequency of any given OTU is equal
across all sample groups. The alternate hypothesis is that the frequency of
the OTU is not the same across all sample groups.
- Kruskal-Wallis: The null hypothesis is that the location paramater of the
groups of abundances for a given OTU is the same. The alternate hypothesis is
that at least one of the location parameters is different.
- ANOVA: the null hypothesis is that the means of the observations in the groups
are the same, the alternate is that at least one is not.
- Mann-Whitney-U: the null hypothesis is that the distributions of the groups
are equal, such that there is a 50 percent chance that a value from group1 is
greater than a value from group2. The alternate is that the distributions are
not the same.
- T-test: the null hypothesis is that the means of the two groups are the same
versus the alternate that they are unequal.
The available tests are:
- ANOVA: one way analysis of variance. This test compares the within-group
variance to the between-group variance in order to assess whether or not the
sample groups have even frequencies of a given OTU. It generalizes the t-test
to more than two groups. This is a parametric test whose assumptions are
likely violated by data found in most gene surveys.
- kruskal_wallis: nonparametric ANOVA. This test is functionally an expansion of
ANOVA to cases where the sample means are unequal and the distribution is not
normal. The assumption that the distribution from which each group (within a
single OTU) came is the same remains. This is a nonparametric test.
- g_test: goodness of fit or log-likelihood ratio test. This test compares the
ratio of the OTU frequencies in the sample groups to an ‘extrinsic hypothesis’
about what their distribution should be. The extrinsic hypothesis coded in this
script is that all sample groups have equal OTU frequencies. The test compares
the ratio of the observed OTU frequencies in the sample groups to the expected
frequencies based on the extrinsic hypothesis. This is a parametric test.
- parametric_t_test: Student’s t-test. This test compares the frequencies of an
OTU in one sample group versus another sample group to see what the probability
of drawing the samples given that each sample had an equal proportion of the OTU
in it. This is a parametric test whose assumptions are likely violated by data
found in most gene surveys.
- nonparametric_t_test: nonparametric t-test is calculated using Monte Carlo
simulation. This test performs in the same way as the parametric t-test, but
computes the probability based on a boot-strap procedure where the sample
group values are permuted. The fraction of the time that a t-statistic
greater than or equal to the observed t-statistic is found is the basis of
the nonparametric p-value. This is a nonparametric test.
- mann_whitney_u: aka Wilcoxon rank sum test is a nonparametric test where the
null hypothesis is that the populations from which the two samples come have
equal means. It is basically an extension of the t-test. This is a nonparametric
test.
- bootstrap_mann_whitney_u: the bootstrapped version of the mann_whitney_u test.
Identical behavior to the nonparametric_t_test. This is a nonparametric test.
Usage: group_significance.py [options]
Input Arguments:
Note
[REQUIRED]
- -i, --otu_table_fp
- Path to biom format table
- -m, --mapping_fp
- Path to category mapping file
- -c, --category
- Name of the category over which to run the analysis
- -o, --output_fp
- Path to the output file
[OPTIONAL]
- -s, --test
- Test to use. Choices are: nonparametric_t_test, bootstrap_mann_whitney_u, ANOVA, kruskal_wallis, g_test, parametric_t_test, mann_whitney_u [default: kruskal_wallis]
- --metadata_key
- Key to extract metadata from biom table. default: taxonomy]
- --permutations
- Number of permutations to use for bootstrapped tests.[default: 1000]
- --biom_samples_are_superset
- If this flag is passed you will be able to use a biom table that contains all the samples listed in the mapping file as well as additional samples not listed in the mapping file. Only their intersecting samples will be used for calculations.
- --print_non_overlap
- If this flag is passed the script will display the samples that do not overlap between the mapping file and the biom file.
Output:
This script generates a tab separated output file with the following headers:
- OTU: OTU id
- Test-Statistic: the value of the test statistic for the given test
- P: the raw P value returned by the given test.
- FDR_P: the P value corrected by the Benjamini-Hochberg FDR procedure for
multiple comparisons.
- Bonferroni_P: the P value corrected by the Bonferroni procedure for multiple
comparisons.
- groupX_mean: there will be as many of these headers as there are unique values
in the mapping file under the category passed with the -c option. Each of these
fields will contain the mean frequency/abundance/count of the given OTU for the
given sample group.
- Taxonomy: this column will be present only if the biom table contained Taxonomy
information. It will contain the taxonomy of the given OTU.
Find which OTUs have the highest probablilty of being differently represented depending on the sample category ‘diet’ using a G test:
group_significance.py -i otu_table.biom -m map_overlapping.txt -c diet -s g_test -o gtest_ocs.txt
Find which OTUs are differentially represented in two sample groups ‘before_after’ using a T-test:
group_significance.py -i otu_table.biom -m map_overlapping.txt -c before_after -s parametric_t_test -o tt_ocs.txt
Find which OTUs are differentially represented in the sample groups formed by ‘diet’ based on nonparamteric ANOVA, aka, Kruskal Wallis test. In addition, prevent the script from erroring because the biom table samples are a superset of the mapping file samples, and print the non-overlapping samples:
group_significance.py -i otu_table.biom -m map.txt -c diet -s kruskal_wallis -o kw_ocs.txt --biom_samples_are_superset --print_non_overlap
Find which OTUs are differentially represented in the sample groups formed by ‘before_after’ based on bootstrapped T-testing with 100 permutations:
group_significance.py -i otu_table.biom -m map_overlapping.txt -c before_after -s nonparametric_t_test --permutations 100 -o btt_ocs.txt