Detection of Outlying Cells in Contingency Tables Using Model Based Diagnostics

Detecting outliers in contingency table is an interesting statistical problem and it poses additional difficulties due to the polarization of cell counts. The fundamental definition of ’markedly deviant’ cell as an outlier is clearly exploited in this study by introducing a pivot element to capture the deviations. The present study considers a two-step confirmatory procedure to detect outliers in I × J contingency table. The procedure deals with (i) identifying the reliable set of candidate outliers using the deviation from the pivot element and then (ii) detect those set of outlying cells by examining different type of residuals of the suitable fitted model. The robustness of the procedure is investigated through a simulation study along with applications to real datasets.


Introduction
In recent years, a great deal of attention has been paid to the accommodation and identification of unusual observations (outliers) in the data. Outliers may be real errors, or else accurate but unexpected observations which could shed new light on the phenomenon under study (Barnett and Lewis (1994)). Unlike in metric case, there exists no clarity in the definition of outliers for categorical data as the cells are purely frequency or counts of a contingency table. Outliers are only vaguely described as such cell frequencies which deviate markedly from the expected value or cause a significant lack of fit. Hence, an attempt has been made to explain the fundamental meaning of 'markedly deviant' as a pivotal element by answering; which cell, from where and, by how much, based on the generic characteristics of the table.
Many classical statistical methods are extremely sensitive even to slight deviations from usual distributional assumptions. Until now research on outliers in I ×J contingency tables has been restricted mainly to the study on independence. Graphical display such as biplots, mosaic plots, etc., can also be useful in studying the association between the I rows and J columns and could be useful in identifying the outlying cells in contingency table (Friendly (2000); Beh and Lombardo (2014)). Kuhnt (2004) described a procedure to identify outliers based on the tails of the Poisson distribution and declared a cell as outlier if the actual count falls in the tails of the distribution. Rapallo (2012) studied the pattern of outliers by fitting log-linear model and test the goodness of fit to specify the notion of outlier with the use of algebraic statistics. Kuhnt, Rapallo, and Rehage (2014) detected outliers through subsets of cell counts called minimal patterns for the independence model. Mignone and Rapallo (2018) identified the outlying cells based on a set of proportions in a contingency table. Sripriya and Srinivasan (2018) has proposed a new method to detect outliers in two-dimensional tables. However, this study presents an alternative approach to detect outliers based on the assumption of model independence.
Residual based techniques has been widely used to detect outliers in contingency tables (Haberman (1973); Fuchs and Kenett (1980); Bradu and Hawkins (1982); Yick and Lee (1998);Simonoff (1988); Lee and Yick (1999)). Even though, the residual technique has been used, no cutoff criteria is provided in choosing the maximum residuals and is more heuristic in nature.
Further, polarization of cell counts is one of the major problem when it comes to outlier detection. Polarization is basically an uneven distribution of counts in the I × J Table. Polarization in contingency tables involve presence of counts/frequencies of disparate nature, such as presence of zero counts, low counts, high counts, and extreme values, etc. Suppose a table consists of more number of zero counts and very few high counts forming unusual clusters which could affect the inference of I × J table, in addition to the detection of outliers (Sripriya, Srinivasan, and Gallo (2019)). Thus, the structure and nature of cell counts in a contingency table play an important role in the data analysis with the cell counts ranging from zero to very high frequencies (Sangeetha, Subbiah, Srinivasan, and Nandram (2014)). Following, Subbiah and Srinivasan (2008) on the sensitivity analysis of 2 × 2 tables, location of polarized counts in the table pose additional challenge in the detection of outliers.
In this paper, we propose a two-step confirmatory procedure to detect potential outliers in two-way contingency table. Firstly, the method identifies the reliable set of candidate outliers in I × J table through the deviation from the pivot element. Secondly, the model based diagnostics is used to obtain the results followed by boxplots to confirm the outlying cells.

Proposed method
Consider N sample observations that are cross-classified in an I × J (=N ) contingency table, and Y k , k = 1, . . . , N . are assumed to be the realizations of random variables. Once a contingency table is constructed, the first interest will be the hypothesis of either homogeneity or independence depending on the sampling scheme (Agresti (2002)). When the null hypothesis is rejected, the cell residuals are investigated to identify the cells which deviate greatly from others. The cell is considered to be an outlier when the observed frequency deviates markedly from the corresponding expected frequency under the null model. Let n ij be the observed cell frequencies of I × J table, N = I i=1 J j=1 n ij be the total frequency, and let T = N/k; where k = IJ, be the pivot element through which the markedly deviant cells are obtained as the candidate set of outliers, denoted by the subset as S. For an I × J table, calculate the deviations D ij = |T − n ij | and examine the deviations D ij for each row and if any D ij is markedly deviant from the neighbouring cells then that particular cell is said to be discordant and is included in the subset S. The steps involved in the confirmatory procedure are as follows: Step 1: Given an I × J table, locate the set of candidate outliers S, using D ij = |T − n ij |.
Step 2: Fit a Poisson Log-Linear model for the data with S as the nature of the data is count.
If the model fits well go to step 3, else step 4.
Step 3: Examine different types of residuals associated with the model and detect the outliers through boxplot of residuals.
Step 4: Fit a Negative Binomial model and do step 3.
Residual techniques have been carried out by researchers in order to identify the outlying cells in a table by considering residuals greater than ∓3. In this heuristic approach, outliers are identified irrespective of the polarization of cell frequencies and order of the contingency tables. To overcome this, the box plot of different types of residuals has been considered to identify the outlying cell. The different diagnostic measures considered are, (i) Response residual, (ii) Deviance residual, (iii) Pearson residual, and (iv) Deleted residual.
Thus this procedure provides a systematic approach of identifying outliers under conditions of polarity for varying order of the table. The following section deals with examining the robustness of proposed procedure as envisaged through a simulation study.

Simulation study
The study of over 100 real time datasets available in the literature has shown that polarization is largely observed in tables of order more than 2 × 2. However, the study considered tables of order (3 × 3), (4 × 4) and (5 × 5) with N varying from 50 to 350 for the detection of outliers.
The cell frequencies of the tables are assumed to follow Mult (N, (p 1 , p 2 , . . . , p k )) where p i ∼ U(0, 1); i = 1, 2, . . . , k . . . The behaviour of different types of residuals with contaminating the cells has been observed in the process of diagnostics for outlier detection. Here, contamination is restricted to single cell at a time and the number of cells to be contaminated are selected using min{I, J} where I and J be the number of rows and columns respectively. Different level of contamination α (10% to 100% of row total) are considered and repeated 500 times.
We examined the consistency of correctly identified cells among four different residuals in this simulation study.
The six different scenarios described below are carried out through a simulation study and the results are presented in Table 1-6.
Generate 500 tables of size 3×3 and N ranges from 50 to 100. The results reveals that the response and deleted residuals performs well in detecting the outliers in Poisson model and the response residuals performs well in Negative Binomial model. The Pearson residuals yield a poor results in detecting outliers in this approach.
Generate 500 tables of size 3 × 3 and N ranges from 100 to 350. The residual analysis shows that response and deleted residual identified the outliers to a greater level in both the models and also the four residuals yields better performance in Poisson model than Negative Binomial model.
Generate 500 4 × 4 tables and N ranges from 50 to 100. The results reveals that all the four residuals performed poorly in detecting the outliers except response residuals in Negative Binomial model.
Generate 500 4 × 4 tables and N ranges from 100 to 350. The results reveals that all the four residuals performed poorly in detecting the outliers in both the models due to the behaviour of the neighbouring cells and probably even distribution of counts in the table generated.
Generate 500 5 × 5 tables and N ranges from 50 to 100. The results reveals that all the four residuals performed poorly in detecting the outliers except response residuals in Poisson Log-Linear model.
The simulation study has shown that polarization of cell counts is a major issue in the detection of outliers in I × J contingency tables. Indeed, the use of residuals as a suitable diagnostic measure under the suitable model with boxplot turns out to be a good choice in detecting the outlying cells. The present simulation study is restricted to smaller tables and could be extendedn to modelling higher dimensional tables for detecting outliers. Further to simulation, the study explored certain well known data to establish the results of simulation.

Data analysis 4.1. Student's enrolment data
The study consist of Student's enrolment data of Northern Territory, Australia conducted in seven community schools in eight different periods of the year and presented in the following table (Yick and Lee (1998)). The primary interest lies in detecting the outlying cells from the data, if any, before carrying out further analysis.

Artificial data
As a second illustration, we consider 5 × 5 contingency table from Simonoff (1988) an artificial data presented in the following table, with induced outliers.

Conclusions
Diagnostics in I × J contingency table has drawn a great deal of attention by the statisticians for many years but the notion of outliers is not well defined. There is no general agreement among the statisticians about the detection of outliers due to the polarization of cell frequencies in contingency tables. Such polarized cells in I × J contingency tables has been examined through the independence of attributes. In this direction, a two phase objective is devised with the identification of pivot element to examine their deviations and then a confirmatory approach to identify the outliers a model based diagnostics.
The procedure deals with finding the reliable set of candidate outliers through a distance measure D ij = |T − n ij | and then applying the confirmatory procedure by fitting a suitable model and the usual diagnostic measures (residuals) followed by boxplot to identify the outlying cells. The stability of our proposed methods towards the identification of outliers is examined through a simulation study. The results have revealed that response and deleted residuals approach identifies the outliers to a greater extent than compared to other residual methods. Moreover, it is evident that the results provide an idea on impact of polarization in the table, and is found to be useful in detecting outliers.
Based on the numerical results, we conclude that the two-step confirmatory procedure as a combination of suitable diagnostic measure and an appropriate graphical approach through boxplots could be a viable approach in detecting outlier cells in I × J contingency tables. The proposed pivot element detection technique is resistant to masking and swamping effects. The results based on fitting of other generalised linear models with the presence of zero cell frequencies to detect outliers is under investigation.