Robust Statistical Inference for High-Dimensional Data Models with Application to Genomics


  • Pranab Kumar Sen University of North Carolina, Chapel Hill, U.S.A.



In high-dimension (K) low sample size (n) environments, often nonlinear, inequality, order or general shape constraints crop up in complex ways, and as a result, likelihood based optimal statistical inference procedures may not exist, at least, may not be in manageable form. While some of these inference problems can be treated in asymptotic setups, the curse of dimensionality (i.e., K >> n with often n small) calls for a different type of asymptotics (in K) with different perspectives. Roy’s union-intersection principle provides some alternative approaches, generally more amenable for K >> n environments. This scenario is appraised with two important statistical problems in genomic studies: a large number of (possibly dependent) genes with heterogeneity amidst a smaller sample create impasses for standard robust inference. These perspectives are examined here in a nonstandard statistical analysis.


Chakraborty, R., and Rao, C. R. (1991). Measurement of genetic variation in evolutionary studies. In C. R. Rao and R. Chakraborty (Eds.), Statistical Methods in Biological and Medical Sciences. Handbook of Statistics (Vol. 8, p. 271-316). Amsterdam: Elsevier.

Coffin, J. M. (1986). Genetic variation in AIDS viruses. Cell, 46(3), 1-4.

Gini, C. W. (1912). Variabilita e mutabilita. Studi Economico-Giuridici della R. Universita de Cagliary, 46(2), 3-159.

Hahn, B. H., Shaw, G. M., Taylor, M. E., Redfield, R. R., and Markham, P. (1986). Genetic variation in HTLV-III / LAV over time in patients with AIDS. Science, 232, 548-1553.

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics, 19, 293-325.

Pinheiro, A. S., Sen, P. K., and Pinheiro, H. P. (2005). Decomposability of highdimensional diversity measures: quasi U-statistics, martingales, and nonstandard asymptotics. Submitted for publication.

Pinheiro, H. P., Pinheiro, A. S., and Sen, P. K. (2005). Comparison of genomic sequences using the hamming distance. Journal of Statistical Planning and Inference, 130, 225-239.

Qui, X., Brooks, A., Klebanov, L., and Yakovlev, A. (2005). The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics, 6, 120.

Qui, X., Klebanov, L., and Yakovlev, A. (2005). Correlations between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Statistical Applications in Genetics and Molecular Biology, 4, 34.

Roy, S. N. (1953). On a heuristic method of test construction and its use in multivariate analysis. Annals of Mathematical Statistics, 24, 220-238.

Schaid, D. J., McDonnell, S. K., Hebbring, S. J., Cunningham, J. M., and Thibodeau, S. N. (2005). Nonparametric tests of association of multiple genes with human disease. American Journal of Human genetics, 76, 789-793.

Sebastiani, P., Gussoni, E., Kohane, I. S., and Ramoni, M. F. (2003). Statistical challenges in functional genomics. Statistical Science, 18, 33-70.

Sen, P. K. (1999). Utility-oriented Simpson-type indexes and inequality measures. Calcutta Statistical Association Bulletin, 49.

Sen, P. K., Tsai, M.-T., and Jou, Y.-S. (2005). High-dimension low sample size perspectives in constrained statistical inference: The SARSCoV RNA genome in illustration. Submitted for publication.

Silvapulle, M. J., and Sen, P. K. (2004). Constrained Statistical Inference: Inequality, Order, and Shape Restrictions. New York: Wiley.

Simpson, E. H. (1949). The measurement of diversity. Nature, 163, 688.

Tsai, M.-T., and Sen, P. K. (2005). Asymptotically optimal tests for parametric functions against ordered functional alternatives. Journal of Multivariate Analysis, 95, 37-49.

Yoshihara, K. I. (1993). Weakly Dependent Stochastic Sequences and Their Applications: Order Statistics Based on Weakly Dependent Data (Vol. III). Tokyo: Sanseido.




How to Cite

Sen, P. K. (2016). Robust Statistical Inference for High-Dimensional Data Models with Application to Genomics. Austrian Journal of Statistics, 35(2&3), 197–214.