EXTRACTING INFORMATION FROM INTERVAL DATA USING SYMBOLIC PRINCIPAL COMPONENT ANALYSIS

We address the definition of symbolic variance and covariance for random interval-valued variables, and present four known symbolic principal component estimation methods using a common insightful framework. In addition, we provide a simple explicit formula for the scores of the symbolic principal components, equivalent to the representation by Maximum Covering Area Rectangle. Furthermore, the analysis of a real dataset leads to a meaningful characterization of Internet traffic applications.


Introduction
The low cost of information storage combined with recent advances in search and retrieval technologies has made available huge amounts of data, the so-called big data explosion.New statistical analysis techniques are now required to deal with the volume and complexity of this data.One promising technique is Symbolic Data Analysis (SDA), introduced by E. Diday (1987).
In conventional data analysis, the variables that characterize an object can only take single values.SDA introduces symbolic random variables which can take values over complex data structures like lists, intervals, histograms or even distributions (Billard and Diday 2006).Symbolic data may exist on their own right or may result from the aggregation of a base dataset according to the researchers interest.For example, suppose that our goal is to characterize the ages of university teachers.The variable that records the teachers' age will have as many observations as teachers, and these can differ among universities.Let us assume that a given university has 1000 teachers, and the values ω 1 , . . ., ω 1000 are the teachers' ages.SDA calls these values micro-data.In conventional statistical analysis, the universities would have to be characterized by single-valued variables, e.g. the mean teachers' age.SDA can deal with more complex data structures, called macro-data.For example, the teachers' age can be aggregated into one or multiple intervals.Our main interest in this paper is on intervalvalued data, where macro-data corresponds to the interval between minimum and maximum of micro-data values: [a, b] = [min {ω 1 , . . ., ω 1000 }, max {ω 1 , . . ., ω 1000 }].
The paper is organized as follows.Section 2 presents basic descriptive statistics, including symbolic variances and covariances, for interval-valued data.Section 3 introduces Symbolic Principal Component Analysis (SPCA) for interval-valued data.Section 4 uses SPCA on the analysis of Internet data produced by six different Internet applications.Finally, some conclusions are drawn in Section 5.

Basic descriptive statistics
There have been several proposals for definitions of symbolic versions of sample mean, variance, covariance, and correlation, according to various types of symbolic data, including interval-valued data (Billard and Diday 2006).
We assume that the collected interval-valued data are realizations of random vectors.As such, we consider a random interval-valued vector X = (X 1 , . . ., X p ) t , where X j = [A j , B j ], with A j and B j being random variables verifying P (A j ≤ B j ) = 1, denotes the j-th random intervalvalued variable of X.Even though this is the common representation of random intervalvalued variables, we follow the approach of Vilela (2015) and write the intervals X j in terms of their centers, C j = (A j + B j )/2, and their ranges, This choice leads to a clear interpretation of an interval in terms of its "location" on the real line along with its length; moreover it enables for the unification of several results in the literature (cf.Vilela (2015) and references therein).Likewise, the random vector X is equivalently represented by the random vector of centers, C = (C 1 , . . ., C p ) t , and the random vector of ranges, R = (R 1 , . . ., R p ) t .Let (C 1 , . . ., C n ) t and (R 1 , . . ., R n ) t denote the vectors of centers and ranges obtained from a random sample of size n from X, where C i = (C i1 , . . ., C ip ) t and R i = (R i1 , . . ., R ip ) t characterize the i-th entity or object of the sample.In this setting, a natural proposal for sample symbolic mean of the interval-valued variable X j is to use the traditional sample mean of the centers, X j = C j with C j = n i=1 C ij /n (Billard and Diday 2006).As concerns the sample symbolic variance of the interval-valued variable X j , we express the proposals available in the literature as the sum of two components, the first accounting for the variability of the associated centers and the second for the size of the associated ranges, in the form with the nonnegative weight α accounting for the relevance given to the sizes of the ranges.
The first case (α = 0) ignores the contribution of the ranges, simply turning the symbolic variance into the variance of the centers (Billard and Diday 2006).Concerning the second case (α = 1/4), we note that as R ij /2 represents the radius of the j-th random interval-valued variable measured at the i-th entity, thus n i=1 R 2 ij /(4n) may be interpreted as the sample second order moment of the radius of the j-th random interval-valued variable.This was originally proposed by De Carvalho, Brito, and Bock (2006).The third case, presented in Bertrand and Goupil (2000), corresponds to choosing the weight α = 1/12, which is derived assuming that micro-data are uniformly distributed on the associated macro-data interval.
In the same manner, we consider proposals for the sample symbolic covariance between two interval-valued variables X j and X l that express it as the sum of two components, the first accounting for the sample covariance of the associated centers and the second for the size of the associated ranges, in the form with the nonnegative weight β accounting for the relevance given to the sizes of the ranges associated to the interval-valued variables X j and X l .The case β = 0 was introduced by Billard and Diday (2003), β = 1/12 by Billard (2008), and finally β = 1/4 by Vilela (2015).
In sequence, we may use (1)-( 2) to construct a sample symbolic covariance matrix S (α,β)  having on the diagonal the sample symbolic variances S (α) jj , given in (1), and outside the diagonal the sample symbolic covariances S (β)  jl , j = l, given in (2), leading to with S CC denoting the sample covariance matrix of the centers and R = [R ij ] the (n × p) matrix of random sample ranges.Particular cases of sample symbolic covariance matrices, S (α,β) , with α ∈ {0, 1/12, 1/4} and β = α or β = 0 have been introduced in the literature (vide Vilela (2015) and references therein).Details about the links between these sample symbolic covariance matrices and SPCA for interval-valued data are discussed in the next section.

Symbolic principal component analysis
Principal component analysis (PCA) is one of the most popular statistical methods to analyse real data.There have been several proposals to extend this methodology to the symbolic data analysis framework, in particular to interval-valued data.The majority of the available methods rely on a strategy called symbolic-conventional-symbolic, meaning that: (i) input data is symbolic (interval-valued, in here), (ii) the data is converted into conventional, to which the conventional PCA method is applied, and (iii) at the end, the PCA results are turned into symbolic, usually by a method called Maximum Covering Area Rectangle (MCAR).
We study four SPCA methods: centers (CPCA) and vertices (VPCA) methods, presented in Cazes, Chouakria, Diday, and Schektman (1997); the Complete Information PCA method (CIPCA), introduced by Wang, Guan, and Wu (2012); and Symbolic Covariance PCA (Sym-CovPCA), proposed by Le-Rademacher and Billard (2012).CPCA and VPCA corresponds to the first SPCA methods proposed in the literature and the last two are among the most recent alternatives.All these methods rely on the symbolic-conventional-symbolic strategy, which can be specified as follows: (i) compute the associated (p × p) sample symbolic covariance matrix S (α,β) (vide Table 1); (ii) obtain the spectral decomposition of S (α,β) , as in the conventional PCA, and (iii) transform the conventional scores into symbolic scores, e.g. using MCAR.
Note that S ( 1 4 ,0) and S ( 1 12 ,0) (vide Table 1) are covariance matrices that use a definition of symbolic variance of an interval-valued variable that does not coincide with the definition of symbolic covariance between the same interval-valued variable and itself.On one hand, this violates a basic rule in the conventional framework, namely that the variance of a variable equals the covariance of the variable with itself.On the other hand, S ( 1 4 ,0) and S ( 1 12 ,0) can be seen as the sum of two symmetric semi-positive definite matrices, which guarantees that they are themselves symmetric semi-positive definite matrices and, therefore, its eigenvalues and eigenvectors verify the usual properties associated with conventional principal components.Overall, Wang et al. (2012), who proposed S ( 1  12 ,0) , argue that the variance of a variable being defined differently from the covariance of the variable with itself turns into an advantage of their method.
Table 1: Sample symbolic covariance matrices, S (α,β) , defined by the combination of several proposals for symbolic variances and covariances along with the corresponding SPCA method.
(α, β) Similarly to the conventional PCA, it may be interesting to define the SPCA based on standardized interval-valued variables, and to do so we introduce the sample correlation matrix as: where jl , for j = l.Equivalently, S (α,β) = U (α) P (α,β) U (α) .Thus, SPCA methods based on standardized interval-valued variables just have to use P (α,β)  instead of S (α,β) .
The most common way to transform conventional objects into symbolic ones for methods following the symbolic-conventional-symbolic strategy is the MCAR representation.This representation was introduced by Chouakria (1998) to obtain the symbolic scores of the CPCA and VPCA methods, but can be used with any other method following the symbolic-conventionalsymbolic strategy.Accordingly with this proposal, the sample interval-value score of the i-th object on the j-th symbolic principal component (SPC) is: where j = 1, . . ., p, i = 1, . . ., n and γj is the j-th eigenvector of S (α,β) , the sample symbolic covariance matrix under consideration.Moreover, the lower bound is formed by the linear combinations of the lower bounds of the original intervals in case of positive weights, γkj > 0, plus the combination of the upper bounds if the weights are negative, γkj < 0 , leading to where cl = 1 n Moreover, the hyper-rectangle formed by the first k SPC, ( SPC i1 , . . ., SPC ik ) t is the MCAR k-dimensional representation of the i-th object.

Analysis of Internet data
In this section we illustrate the use of SPCA through a dataset of Internet traffic, typically observed in backbone networks, and measured during July 2014.Specifically, the dataset contains traffic produced by six different Internet applications, namely Web browsing (produced by HTTP), file sharing (produced by Torrent), streaming, video (YouTube), port scans (produced by NMAP), and snapshots.The first four applications correspond to regular traffic and the last two to Internet attacks.The analysis usually aims at detecting the various Internet applications within a traffic aggregate and/or the separation between regular and illicit traffic.
The dataset comprises 917 traffic objects, corresponding to packet flows of specific applications, which we call datastreams.For each datastream, we registered five different traffic characteristics observed in 0.1 seconds intervals, during 5 minutes.The traffic characteristics registered were the following: number of upstream packets (PU p), number of downstream packets (PDw), number of upstream bytes (BU p), number of downstream bytes (BDw), and number of active TCP sessions (Ses).Thus, each object is characterized by a total of 3000 observations per traffic characteristic, which constitutes our micro-data.
The conventional approach to analyse this data is based on summary statistics of each traffic characteristic.In particular, (Pascoal, Oliveira, Valadas, Filzmoser, Salvador, and Pacheco 2012;Pascoal 2014) used 8 summary statistics (minimum, 1 st quartile, median, mean, 3 rd quartile, maximum, standard deviation, and median absolute deviation) for the above five traffic characteristics, giving a total of 40 variables to describe the datastreams.This approach usually requires a pre-processing step to remove irrelevant and redundant variables; Pascoal (2014) used a robust feature selection method based on mutual information for that purpose.
This dataset is naturally symbolic, since each traffic characteristic is multi-valued.SDA takes into consideration the complex structure of these data, and may lead to clearer interpretation and new insights.In our case, we will use interval-valued variables for each traffic characteristic (our macro-data), instead of the 8 summary statistics listed above.
Given the nature of the data and the existence of potential atypical observations among the micro-data, we decided to trim 1% of the lower and 1% of the higher values.This was only done for the regular applications given that illicit ones have few datastreams and small variability and would be completely eliminated from the dataset, even for such small trimming percentiles.Apart from that, and following the recommendations in (Pascoal et al. 2012;Pascoal 2014), data was smoothed using a logarithm transformation (ln(x + 1), to overcome the existence of zeros).SPCA, estimated according with the four methods under study, was applied to this dataset and percentages of explained variance from the conventional and symbolic approach are summarized in Table 2.The conventional analysis of Table 2 suggests to retain 2 principal components, explaining between 80.3% (CIPCA) and 95.6% (SymCovPCA) of the total sample variance associated with S (α,β) , meaning that e.g. for CIPCA, ( λ1 + λ2 )/ 5 j=1 λj = 0.803.The results obtained with CPCA and SymCovPCA are similar, and so are the results obtained with VPCA and CIPCA.Moreover, these similarities are easily explained by the expressions of Table 1.
Table 3 shows the loadings of the first and second SPC, obtained with the four methods.In the case of SymCovPCA, the number of upstream and downstream bytes (BUp, BDw) have the highest loading (on absolute value) in the definition of the first SPC.Thus, the center and range of the first SPC can be interpreted as a weighted sum of the number of upstream and downstream bytes.The number of bytes is sometimes referred to as the traffic volume.
For the center, the negative coefficients indicate that datastreams with high (low) number of bytes in both directions have low (high) center values on the first SPC.For the range, the coefficients are taken in absolute value, so datastreams with high (low) number of bytes in both directions have high (low) range values on the first SPC.Recall that the range expresses the inner variability of micro-data.As for the second SPC, the loading associated with number of sessions stands out.Thus, datastreams characterized by an high (low) number of sessions have high (low) center and range values on the second SPC.
The SymCovPCA scores are shown in Figure 1(a).Each datastream is represented by a rectangle, defined by the centers and ranges of the first two SPC.It can be said that the various Internet applications are, in general, well identified, since the datastreams show similar patterns for the same application.Most datastreams have a small minimum traffic volume (number of bytes), with the corresponding rectangles leaning to the right side.HTTP shows no distinctive characteristic, since the datastreams spread over all score ranges.This can be explained by the heterogeneity of user behaviours and accessed Web pages, typical of Web browsing.Torrent is concentrated on the upper part of the graph, due to its high number of sessions.The high number of sessions and large variability of the traffic volume is mostly explained by the variation on the number of available peers during traffic sharing sessions.The graph also suggests the existence of several Torrent groups, but this pattern will become clearer with the CIPCA method.The behaviour of video related with the second SPC contrasts with that of Torrent: it is concentrated in the lower part of the graph, due to its low number of sessions.Moreover, video is the application with the highest traffic volume.We may say that video datastreams are characterized by a low number of high volume sessions, and Torrent by a high number of high volume sessions.Streaming has a behaviour similar to video, but with higher number of sessions and lower traffic volume.NMAP is the application with smallest volume and variability, and has also a relatively low number of sessions.Finally, the behaviour of snapshot is in-between video and streaming, both in terms of volume and number of sessions.Snapshot has two clear groups, that differ on the peak traffic volume, and correspond to full desktop and partial desktop uploads, respectively.
Table 3 shows that the loadings obtained with CIPCA are much higher (in absolute value) for BDw (first component) and BUp (second component).Thus, the first SPC can be interpreted as the number of bytes down (BDw) and the second one as the number of bytes up (BUp).The CIPCA scores are shown in Figure 1(b).Snapshot has the highest upstream peak traffic volume, and is now better separated from video and streaming.NMAP is again the application with smaller rectangles.However, it is now better separated from HTTP, since most HTTP datastreams have higher traffic volume range simultaneously in the upstream and downstream directions.Video and streaming are also well separated, since video datastreams have consistently higher traffic volume ranges simultaneously in both directions.Regarding Torrent, it is now possible to distinguish among three groups: the group centers occur at approximately the same upstream traffic volume; one group has small traffic range in both directions (small rectangles) and high downstream volume, another has high traffic ranges in the downstream direction but small in the upstream direction, and a third one has small downstream volumes but high upstream traffic ranges.These groups emerge from differences on the relative location of peers and the quality/stability of links.The first group corresponds to closer peers from which it is possible to download at higher speeds, the third to farther peers for which the links are less stable and unable to download at high speeds, and the third group is a mixture of the two previous ones.To validate these interpretations, we follow an approach similar to Cadima and Jolliffe (2001) where we consider the truncated SPC with respective loadings, γtr j , equal to γj for the input variables considered to be relevant for interpreting γj (highlighted in boldface in Table 3) and zero otherwise.With these truncated weights, we compute the corresponding centers and ranges of the truncated SPC.Finally, we calculate the correlations between the correspondent truncated and complete (i.e., non-truncated) ranges and centers.If these correlations are close to 1, then the selection of variables performed to give a meaning to the j-th SPC by considering its truncated version gets validated.In our case, for CPCA and SymCovPCA we made the following selection: γtr 1 = (0, γ21 , 0, γ41 , 0) t and γtr 2 = (0, 0, 0, 0, γ51 ) t and for VPCA and CIPCA we choose γtr 1 = (0, γ21 , 0, 0, 0) t and γtr 2 = (0, 0, 0, γ42 , 0) t .In sequence, the first two truncated SPC of the centers and ranges are calculated as well as their sample correlation with the corresponding non-truncated SPC of centers and ranges, with the resulting values being summarized in Table 4.Note that all these correlations are close to 1, with the exception of the correlation between the truncated and complete versions of SP C SymCovP CA 2 ranges.We note that if the inclusion of an additional input variable for the interpretation of SP C SymCovP CA 2 ranges has the merit of increasing the correlation between the truncated and complete versions from 0.774 to 0.902.Nevertheless, it has the drawback of making the interpretation of SP C SymCovP CA 2 ranges more complex and difficult.

Conclusion
Starting from a generic definition of symbolic variance and covariance for random intervalvalued variables, we have used a common insightful framework to present four symbolic principal component estimation methods that rely on a symbolic-conventional-symbolic strategy: CPCA, VPCA, CIPCA, and SymCovPCA.This framework highlighted similarities and differences between these methods.
Aiming at improving the interpretation of symbolic principal components, we proposed the use of truncated versions of symbolic principal components, which are obtained from the (complete) symbolic principal components by relying only on a strict subset of the original symbolic variables.
The analysis of a symbolic dataset containing Internet traffic lead to a clear interpretation of the underlying Internet applications (Web browsing, file sharing, streaming, video, port scans, and snapshots).The analysis pointed out the difficulties in separating illicit traffic from regular one, suggesting the need to develop outlier detection methods for symbolic data.Furthermore, the analysis highlighted similarities between the symbolic principal component estimation methods considered in the paper.

Table 2 :
Eigenvalues of the sample symbolic covariance matrices for each estimation method, and associated cumulative percentage of total conventional variance.

Table 3 :
Eigenvectors of the sample symbolic covariance matrices for each estimation method, called loadings.

Table 4 :
Sample Correlation between non-truncated, SP C M j , and truncated, [SP C M j ] tr , symbolic principal components centers, C[•], and ranges, R[•], according with the selection made for each estimation method, M .