Compositional uncertainty should not be ignored in high-throughput sequencing data analysis

Gregory Brian Gloor, Jean M. Macklaim, Michael Vu, Andrew D. Fernandes


High throughput sequencing generates sparse compositional data, yet these datasets are rarely analyzed using a compositional approach. In addition, the variation inherent in these datasets is rarely acknowledged, but ignoring it can result in many false positive inferences. We demonstrate that examination of point estimates of the data can result in false positive results, even with appropriate zero replacement approaches, using an in vitro selection dataset with an outside standard of truth. The variation inherent in real high-throughput sequencing datasets is demonstrated, and we show that this varia- tion can be approximated, and hence accounted for, by Monte-Carlo sampling from the Dirichlet distribution. This approximation when used by itself is itself problematic, but becomes useful when coupled with a log-ratio approach commonly used in compositional data analysis. Thus, the approach illustrated here that merges Bayesian estimation with principles of compositional data analysis should be generally useful for high-dimensional count compositional data of the type generated by high throughput sequencing. 

Full Text:



Aitchison J (1986). The Statistical Analysis of Compositional Data. Chapman & Hall.

Benjamini Y, Hochberg Y (1995). “Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological), pp. 289–300.

Bottomly D, Walter NAR, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R (2011). “Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum using RNA-seq and Microarrays.” PLoS One, 6(3), e17820.

Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G (2013). “High Throughput Sequencing Methods and Analysis for Microbiome Research.” J Microbiol Methods, 95(3), 401–14.

Ding T, Schloss PD (2014). “Dynamics and Associations of Microbial Community Types across the Human Body.” Nature, 509(7500), 357–60.

Fernandes AD, Macklaim JM, Linn T, Reid G, Gloor GB (2013). “ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq.” PLoS ONE, 8(7), e67019.

Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB (2014). “Unifying the Analysis of High-throughput Sequencing Datasets: Characterizing RNA-seq,16S rRNA Gene Sequencing and Selective Growth Experiments by Compositional Data

Analysis.” Microbiome, 2, 15.

Gierlin ́ski M, Cole C, Schofield P, Schurch NJ, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson G, Owen-Hughes T, Blaxter M, Barton GJ (2015). “Statistical Models for RNA-seq Data Derived from a Two-condition 48-Replicate Experiment.” Bioinformatics.

Gloor GB, Macklaim JM, Fernandes AD (2016). “Displaying Variation in Large Datasets: a Visual Summary of Effect Sizes.” Journal of Computational and Graphical Statistics, in press.

Halsey LG, Curran-Everett D, Vowler SL, Drummond GB (2015). “The Fickle P value Gen- erates Irreproducible Results.” Nat Methods, 12(3), 179–85.

Hedges LV, Olkin I (1985). Statistical Methods for Meta-Analysis. Academic Press, Orlando. ISBN 0123363802 (alk. paper).

Holmes I, Harris K, Quince C (2012). “Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics.” PLoS One, 7(2), e30126.

Hsiao EY, McBride SW, Hsien S, Sharon G, Hyde ER, McCue T, Codelli JA, Chow J, Reisman SE, Petrosino JF, Patterson PH, Mazmanian SK (2013). “Microbiota Modulate Behavioral and Physiological Abnormalities Associated with Neurodevelopmental Disorders.” Cell, 155(7), 1451–63.

La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, Wang Q, Sodergren E, Weinstock G, Shannon WD (2012). “Hypothesis Testing and Power Calculations for Taxonomic-based Human Microbiome Data.” PLoS One, 7(12), e52078.

Lovell D, Mu ̈ller W, Taylor J, Zwart A, Helliwell C, Pawlowsky-Glahn V, Buccianti A (2011). “Proportions, Percentages, ppm: do the Molecular Biosciences Treat Compositional Data

Right?” Compositional Data Analysis: Theory and Applications, pp. 193–207.

Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Ba ̈hler J (2015). “Proportionality:

a Valid Alternative to Correlation for Relative Data.” PLoS Comput Biol, 11(3), e1004075.

Macklaim MJ, Fernandes DA, Di Bella MJ, Hammond JA, Reid G, Gloor GB (2013). “Com- parative meta-RNA-seq of the Vaginal Microbiota and Differential Expression by Lacto- bacillus iners in Health and dysbiosis.” Microbiome, 1, 15.

Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008). “RNA-seq: an Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays.” Genome Res, 18(9), 1509–17.

Mart ́ın-Fern ́andez JA, Barcel ́o-Vidal C, Pawlowsky-Glahn V (2003). “Dealing with Zeros and Missing Values in Compositional Data Sets using Nonparametric Imputation.” Mathemat- ical Geology, 35(3), 253–278.

Mart ́ın-Fern ́andez JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J (2014). “Bayesian-multiplicative Treatment of Count Zeros in Compositional Data Sets.” Statistical Modelling, 1:25.

McMurrough TA, Dickson RJ, Thibert SMF, Gloor GB, Edgell DR (2014). “Control of Catalytic Efficiency by a Coevolving Network of Catalytic and Noncatalytic Residues.” Proc Natl Acad Sci U S A, 111(23), E2376–83.

Munsky B, Neuert G, van Oudenaarden A (2012). “Using Gene Expression Noise to Under- stand Gene Regulation.” Science, 336(6078), 183–7.

Palarea-Albaladejo J, Mart ́ın-Fern ́andez JA (2015). “zCompositions — R Package for Multi- variate Imputation of Left-censored Data under a Compositional Approach.” Chemometrics and Intelligent Laboratory Systems, 143(0), 85 – 96. ISSN 0169-7439.

Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015). Modeling and Analysis of Compositional Data. John Wiley & Sons.

Smith AB, Maxwell A (2006). “A Strand-passage Conformation of DNA gyrase is Required to Allow the Bacterial Toxin, CcdB, to Access its Binding Site.” Nucleic Acids Res, 34(17), 4667–76.

van den Boogaart KG, Tolosana-Delgado R (2008). ““Compositions”: A Unified R Package to Znalyze Compositional Data.” Computers & Geosciences, 34(4), 320 – 338. ISSN 0098- 3004.



  • There are currently no refbacks.

@Matthias Templ (using Open Journal Systems) -- see previous editions at