The Impact of COVID-19 on Relative Changes in Aggregated Mobility Using Mobile-phone Data

Mobile-phone data can be used to investigate the mobility of a big part of a population in a given period. Here we have analyzed this information for Austria in the ﬁrst half year of the COVID-19 pandemic. Especially the period around the ﬁrst lockdown was of interest, and our focus is on exploring possible diﬀerences between age groups and among females and males. The data is once treated from an absolute point of view, by analyzing the numbers as they are reported and from a relative point of view, with the help of compositional data analysis tools. Our goal is to compare analyses of the absolute values and of relative information, in order to reveal possible diﬀerences in the groups formed by age and gender. It turns out that both types of analyses provide diﬀerent and partially complementary insights. This is also underlined when analyzing data from call durations, or subdata just for speciﬁc Austrian districts.

The location of a mobile-phone is known for the Mobile Network Operator (MNO) of the Global System for Mobile Communication (GSM) network. We have partnered with an MNO in Austria to access such anonymized data. We have defined an aggregation method to understand the overall mobility of the whole population. Our data set, the aggregation, anonymization approach and the various phases of the lockdown in the first half year of the pandemic are outlined in detail in (Heiler, Reisch, Hurt, Forghani, Omani, Hanbury, and Karimipour 2020).
With the outbreak of COVID-19 and the subsequent lockdown in Austria, the mobility behavior of the population has changed significantly. This is reflected in the mobility data derived from mobile-phone information (Heiler et al. 2020). An appropriate measure needs to be established to measure mobility, which reflects the mobility. One possibility is the Radius of Gyration (ROG) (Gooch 2011). It is formally defined below, and refers to the time-weighted distance of the movement locations to the primary location. We compute it on a daily level. Its unit is meters and the values are strictly positive. In this work, we analyze the aggregated (median) ROG of the whole population of Austria for various groups as a time series. The groups are defined by gender-or age groups.
Traditionally, a comparison is made in terms of absolute information, i.e., the ROG time series values of the different groups are analyzed in their unit of meters. We have conducted such an analysis (Reisch, Heiler, Hurt, Klimek, Hanbury, and Thurner 2021) which focuses on gender differences.
An alternative is to compare relative information, for example the ROG of the males with respect to females, or in terms of the ratio males to females. This leads to a dimensionless time series, and to a different aspect of data analysis which emphasizes the differences between the individual groups. A joint increase or decrease in both groups may not lead to a big change of the ratio. On the other hand, the ratio will change if the values of one group increase, and at the same time they decrease in the other group, or vice versa. Here again, the relative change rather than the absolute change is important. For example, if the ROG changes from 1000m to 2000m in one group, and from 2000m to 1000m in the other group, the ratio would change from 1/2 to 2. The same change could be observed if the absolute values in both groups would be bigger by a factor of 10. Thus, absolute values are no longer relevant in this consideration, because a multiplication by any positive constant leads to the same ratio. This is still trivial in case of comparing two groups, but it is no longer straightforward when relative information of several groups, such as age classes, should be compared. Compositional data analysis is devoted to this problem of analyzing relative information (Aitchison 1986;Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado 2015;Filzmoser, Hron, and Templ 2018). In fact, compositional data analysis is frequently used in geosciences, but also more and more in other fields such as biology (Espinoza, Shah, Singh, Nelson, and Dupont 2020), bioinformatics (Quinn, Erb, Richardson, and Crowley 2018), economics (Trinh, Morais, Thomas-Agnan, and Simioni 2019), marketing (Joueid and Coenders 2018), medicine (Dumuid, Pedišić, Palarea-Albaladejo, Martín-Fernández, Hron, and Olds 2020), in applications with spatially dependent data (Thomas-Agnan, Laurent, Ruiz-Gazen, Nguyen, Chakir, and Lungarska 2021), etc.
We analyze the movement data of the first half year of the COVID-19 pandemic. In this period, the lockdown starting on March 2020 had more substantial effects on mobility than subsequent lockdowns. Our goal is to compare analyses of the absolute values and of relative information, in order to reveal possible differences in the groups formed by age and gender. Potentially, groups at risk could be detected and special interventions placed to help these ensure their health and safety. This work is structured as follows. In Section 2.1 we give a brief mathematical introduction to compositional data analysis. Section 2.2 provides more details about the mobile-phone data used and about the quantities derived. In addition to the mobility measured as the ROG, we will investigate the call duration per day, again aggregated by the median. Section 3 presents comparisons of the analysis based on absolute and on relative information, and the final Section 4 summarizes the findings.

Compositional data analysis
From the point of view of compositional data analysis, a composition is defined as multivariate vector, consisting of strictly positive values, where the absolute numbers as such are not of interest, and only relative information is relevant for the analysis (Filzmoser et al. 2018). The median ROG values of different age categories for a particular day, and every age category can be considered as a composition. We use the notation x 1 , . . . , x D for the compositional parts of D categories, and the composition is written as the (column) vector x = (x 1 , . . . , x D ) . For every day in the data we will observe such a composition, which in fact leads to a multivariate compositional time series. The interest is in relative information in terms of the ratios, and thus all pairs x j /x k , for j, k = 1, . . . , D, should be considered in the analysis. Obviously, the pairs for j = k are not relevant, and pairs of the reverse ratio x k /x j do not contain potentially new information. This motivates to consider the logarithm of the ratios, ln(x j /x k ), so-called log-ratios. The reverse ratios have a different sign, and thus do not need to be considered, and their variance is the same as for the original ratio. Moreover, log-ratios tend to be more symmetric than simple ratios without a logarithm (Pawlowsky-Glahn et al. 2015).
Still, the resulting D(D − 1) pairs ln(x j /x k ), for k > j, can be represented by only ≤ D − 1 components (Filzmoser et al. 2018), and this motivates to aggregate this information. Consider an aggregation where x j is the geometric mean of the composition x. Then, y 1 represents all relative information about the part x 1 to the other parts in the composition in a form of an average of the log-ratios. This leads to the definition of so-called Centered Log Ratio (CLR) coefficients Aitchison (1986) y = (y 1 , . . . , y D ) with y j = ln The vector y contains all relative information about x. It consists of D components y j which are associated with the relative information about the corresponding part x j . However, it turns out that y 1 +. . .+y D = 0, and thus a representation of data in terms of CLR coefficients leads to singularity (Filzmoser et al. 2018). Although there are ways to circumvent this issue (Filzmoser et al. 2018), we will proceed with CLR coefficients for the following analysis for simplicity.
Consider now a multivariate compositional time series x t = (x t1 , . . . , x tD ) , for the time points t = 1, . . . , T , and the observations x tj for each part j ∈ {1, . . . , D}. The time series expressed in CLR coefficients is y t = (y t1 , . . . , y tD ) , with y tj = ln(x tj /g(x t )), with the geometric mean g(x t ) = ( D j=1 x tj ) 1/D per time point. Since this data representation only reflects relative information of the time series, an additional visualization of the absolute time series values can be interesting to get a more complete picture.
The CLR coefficients result in multivariate data that can be analyzed with the traditional multivariate statistical methods (Filzmoser et al. 2018). A prominent way to represent the information in a lower-dimensional space is to use Principal Component Analysis (PCA). Since PCA is sensitive to data outliers or inhomogeneous data, robust versions have been proposed, also in the compositional data analysis framework (Filzmoser et al. 2018). The resulting loadings and scores are commonly represented in a biplot to get an overview of the multivariate data (Aitchison and Greenacre 2002).

Mobile-phone data
In this work, we analyze two measures obtained from the mobile phone data, the call duration and the radius of gyration ROG. While the meaning of the former is straightforward, the latter needs to be defined.
Consider an individual i := i(t) at a certain day t ∈ {1, . . . , T }. For data privacy reasons, the individual's index will change every day. The data is made available to the researchers already anonymized with a daily changing key.
Furthermore, the current location of the individual's mobile phone is available at the time points t τ = t+τ t , for a number of time points T t per day, where τ t ∈ [0, 1). The corresponding x-and y-coordinates are denoted by (ξ itτ , η itτ ). With this information, the stay duration l itτ for individual i at time point t τ can be computed, which is used to calculate a weighted for individual i for day t. These coordinates are in the middle of the area covered by all the locations which were visited during the day and are dominated by the two most prominently used (longest used) locations: home and work location.
as the squared Euclidean distance between the coordinates (ξ it ,η it ) and (ξ itτ , η itτ ). This requires the coordinate system to be local to obtain valid, i.e. less distorted results. Otherwise, a Haversine 1 distance could be used instead in case of epsg:4326 WGS-84 projection of the coordinates.
The ROG for individual i and day t is then defined as and it thus represents a distance to the center of all the places of stay during that day t weighted by the lengths of the stay duration at the different places.
Details are available and especially a description of how the large quantity of data was handled is available in (Heiler et al. 2020).
Using additional metadata, an individual i can be assigned to a gender group (female, male), to an age group (here we consider the age groups in 15 year intervals: 15-29, 30-44, 45-59, 60-74, and 75+), and to an Austrian district of the daily night location to derive the groups.
Since the distributions are generally very right-skewed, we work with the median per group and day in the following and also ensure k-anonymity for each one.
The resulting time series can be directly investigated in terms of their absolute information, and they can be compared to an analysis based on relative information.

Mobility measured by ROG
The results reported in this section refer to the median values of the ROG per group. To begin with, Figure 1 shows the absolute values for the females (top) and males (bottom) for different age groups. The legend indicates the considered age groups: 15 for age 15-29, 30 for age 30-44, 45 for age 45-59, 60 for age 60-74, and 75 for age elder than 75. For all of the following time series plots, the vertical dashed lines indicate the date March 16 th , 2020, when the restrictions came into action, and the date April 6 th , 2020, when they were relaxed. The data considered here are from the period February 1 st until August 9 th , 2020. The plots clearly show the lockdown by an abrupt decay of the median ROG values in all age classes for both genders. After the lockdown, the order of the values remains the same, from the eldest group with the smallest values, and the youngest group with the highest values, but it is on a much smaller level. The level then increased more or less systematically until the middle of June. Afterwards, the level is not changing a lot, it is lower than at the beginning, and weekly patterns are visible. Note that these weekly time series patterns that are very regular at the beginning are getting somehow distorted, partially also due to holidays (April 13 th , May 1 st , May 21 th , June 1 st , June 11 th ), and they never get back to this regularity.
In the subsequent analyses we consider the female age groups and the male age groups separately as two compositions. Note that it would also be possible to consider this data set as a multi-factorial composition, with age groups and gender as factors, and to analyze the complete composition jointly. Such an approach has been proposed in (Fačevicová, Filzmoser, and Hron 2022), and the mobility data set has been used as an illustration. Figure 2 focuses on the relative information contained in the median ROG values. The plots show the corresponding CLR coefficients for females (top) and males (bottom). While in Figure 1 we have essentially seen a decline of all values at the beginning of the lockdown phase, followed by an increase, we did not pay attention how differently the age groups declined and increased. This is the purpose of the relative view in Figure 2, where we mainly investigate the developments of the age groups to each other.
In both plots of Figure 2 we can see roughly the same pattern after the lockdown: the biggest relative changes are visible for the youngest and the oldest age group, but they go into different directions. While group 15 had the biggest decline, group 75+ increased the values relative to the other age groups. This seems to be counter-intuitive, but it can be explained by the fact that the geometric mean also went down significantly, and the ratio of the values of group 75+ to the geometric mean then even increased after the lockdown.   Another interesting phenomenon is that the groups 60 and 75+ show the biggest increase in mobility (in a relative sense) during the weekends in this lockdown period. Although on a different level, the values from July show a similar structure to those from February. Interestingly, the youngest age group 15 shows a somehow mirrored weekly pattern compared to the elder age groups. This is not visible when looking at the absolute values in Figure 1.
Relative information could also be understood in terms of data proportions. In particular, one could compute the proportion of a group on the total per time point, which in fact corresponds to normalizing the data per time point to a value of 1. Such a proportional presentation is shown in Figure 3 for the ROG values of the female age groups. Obviously, the information contained in this representation is different from CLR coefficients which focus on log-ratio information. One can hardly see any differences between the lockdown period and the remaining period, and thus this kind of "relative view" is not valuable for the analysis.  The median ROG values for the female and male age groups are analyzed in the following with PCA. Here, the method ROBPCA (Hubert, Rousseeuw, and Vanden Branden 2005) is taken, a robust version of PCA which downweights outlying observations. Figure 4 shows the biplot of the first two principal components (PCs) for the clr coefficients. This biplot presentation (as well as all subsequent biplots in this paper) is a so-called form biplot (Aitchison and Greenacre 2002), with favours the representation of the observations in the plot. The coloring is according to the time phases: green before the lockdown, pink during the lockdown period, purple after lockdown until mid of June, and light-blue after this period. The left biplot for the females identifies these four periods as clear clusters, while there is more overlap visible in the right biplot for the males. For the females, the direction of the first PC (71% explained variance) shows a transition of the relative ROG values from the young generation (f15, f30) before lockdown to the old (f75) one during lockdown, and then back to the center. Thus, younger and elderly females show a contrasting behavior in this time period, which was already observed in Figure 2 (top panel). The second PC (21% explained variance) shows also differences between the time periods, but it also reveals weekend effects. Especially on Sundays, the mobility for group f15 was bigger before and after lockdown, but it moved to group f75 during the lockdown phase.
The data structure in the biplot for the males (right plot) looks a bit different, but leads to similar conclusions. PC1 explains 69% and PC2 25% of the variance. Groups m75 and m15 have a similarly diverging behavior of Sunday mobility as observed for the females. The weekdays of the lockdown phase are in the center of the distribution, while for the females they were clearly moved towards group f75. On the other hand, the weekdays in the first time period (February 1 st -March 15 th ) are better distinguishable from the weekdays of the last period (June 15 -August 9); a possible explanation is the fact that the working male population changed the mobility behavior more significantly than that of females due to home office.
A quite contrasting view is revealed in Figure 5, which shows the robust PCA results for the absolute values of ROG, for females (left) and males (right). In both analyses, PC1 explains 98% of the variability, and this direction essentially reflects the big change of the ROG over this period. Otherwise, there is not much information left in these analyses, reflecting the limited usefulness of absolute information if the task compares age groups. Figure 6 investigates the median call duration, reported in seconds, again for the two genders and the age groups. The upper plot shows the absolute values jointly for males and females.   Figure 5: Biplots of the (absolute) median ROG values for females (left) and male (right) age groups. Green color for period before the lockdown, pink for lockdown period, purple after lockdown until mid of June, and light-blue after this period.

Interaction measured by call duration
Here we observe the reverse ordering of the age groups compared to the plots for the ROG values: the lowest values are for the youngest group, and the biggest for the oldest group. The values of the females are systematically higher than those of the males. It is interesting to see that the call durations already started to increase one week before the lockdown. While the ROG time series had their peaks during the weekend, we have the opposite here. This pattern, however, seems to change after the lockdown for group f75 (uppermost line), and it went back to normality only later on.
The bottom plot of Figure 6 presents the CLR coefficients, which are separately calculated for females and males, but presented here jointly for easy comparison. Although the absolute values of the youngest age group also increased with the lockdown, the increase was smaller compared to the other groups, which is reflected by decreasing CLR coefficients. The pattern of f15 and m15 has also an interesting structure: Before the lockdown, the groups had quite different behavior within their gender-group, but during the lockdown phase they became quite similar. From June on, they show again a similar behavior as at the beginning. Another interesting phenomenon can be seen after the lockdown: the two oldest groups show a contrary behavior to the other groups during the weekends. Their decline in call duration during the weekends was much smaller than that of the other age groups. Figure 7 presents biplots of a robust PCA for the CLR coefficients for the female (left) and male (right) age groups. The coloring is taken as in the previous biplots, green before lockdown, pink during, purple after lockdown, and light-blue from June 15 th onwards. PC1 explains 72% of the variability for the females, 54% for males, and PC1 and PC2 together explain about 98% variance in both cases. The different groups which are visible in the biplots are essentially weekend-effects or affects due to the lockdown. These grouping effects are essentially caused by the youngest and oldest age groups. When comparing the first observed period with the last one, we can find clear differences in the corresponding PCA scores. These differences are essentially caused by the changing contrasting behavior between the youngest and elder groups; groups f75 and m75 (and also m30) do not seem to contribute to this difference. A possible explanation is the exploration of alternative communication methods, especially for the elder groups.

Interactions between source and destination
It can be recorded who is actively calling a person, and who is receiving a call. The former   Figure 7: Biplots of the CLR coefficients of the median call duration values for females (left) and male (right) age groups. Green color for the period before the lockdown, pink for lockdown period, light-blue after this period.
person is called source, and the latter destination. Here we investigate the median ROG values for the different age groups of the females and males. However, the data set is more complex now, because a person from a specific age group can be source, while the destination can originate from a different age group. Moreover, both source and destination will have specific median ROG values. Figure 8 illustrates these data for four specific cases: source f45 (f45 src) with destination f75 (f75 dst), and source f75 (f75 src) with destination f45 (f45 dst). In both cases, the median ROG values can be taken from the source group or from the destination group, see also figure legend. Throughout the whole period (here from February 1 st -July 26 th ), the median ROG values from the source groups (solid lines) have slightly higher values than those of the destination groups (dashed lines) for the same age classes, which can be expected because people from the source groups might call from a place outside their usual environment. While the lines are on a similar level at the beginning and at the end of the considered period, the weekly periodicity changes, probably caused by the summer holidays. In the following analyses we are interested in the similarity of the relative ROG values in terms of correlations, before lockdown (February 1 st -March 15 th ) and after (March 16 th -May 31 th ). In order to investigate relative information, the CLR coefficients are computed for a composition with all 25 age combinations of the source-destination groups and all 25 age combinations of the destination-source groups, separately for females and males. Figure 9 shows the resulting correlation matrix for the females as a heat map, left for time points before the lockdown, and right after lockdown. The row and column labels are referring to the group numbers. For example, src1-3 refers to the time series f15 src -f30 dst, or dst5-1 is the series f75 dst -f15 src. The heatmaps show that the correlation structure before and after lockdown has clearly changed. Afterwards, there are more blocks with higher (absolute) correlations, and thus more similarity or dissimilarity between certain age groups. In general, there is a more pronounced difference after lockdown in the mobility behavior between the younger and the elder age groups.

Incorporating spatial location
The mobile phone data also contain information about the location, in our case about the Austrian political district in which the phone has been used. The Austrian regions had different restrictions during the lockdown phase, and in particular people from all districts in Tirol had the most substantial movement restrictions. In this section we will provide some analysis examples based on this additional information, rather than going again into Correlations female before detail with gender comparisons. Thus, in Figure 10 we compare the median ROG values for Kitzbühel, a district in Tirol, and Zell am See, which is also a rural district but located in Salzburg. The absolute values of the female age groups are shown in the upper plots, while the CLR coefficients are presented in the lower plots. Since the same scale is used along the vertical axes, one can clearly see the difference in mobility during the lockdown period in Kitzbühel and Zell am See, and this is also visible in the CLR coefficients. For Kitzbühel, there is much smaller variability of the values during lockdown, and also the relative differences between the age groups become much smaller. The change in the relative differences is not so pronounced for Zell am See. This means that also from a relative point of view, the data structure changes completely in Kitzbühel due to the restrictions. Figure 11 focuses on the male age group m30, and compares the composition of all districts in Tirol with that of all districts in Salzburg. The dashed lines refer to the district capitals (Innsbruck and Salzburg, respectively). These districts behave differently compared to the other districts which are rural with many people commuting to their work place. The values of the districts in Tirol (except Innsbruck) get closer to each other after lockdown, and they start to diverge only in the middle of April. This may be explained by a similar mobility behavior of the m30 group within this period is probably caused by home-office or reduced working time. This seems different in districts of Salzburg, where the CLR coefficients show more variability after lockdown.

Discussion and conclusions
Since the pioneering work of Aitchison (1986) on compositional data analysis, this type of analysis has often been misunderstood as being only applicable to data for which the observations sum up to one -thus proportional data. However, the more recent literature made clear that the constant sum constraint is not at all important because (logarithms of) ratios between the variable values are the building blocks for this analysis, which are unchanged by rescaling an observation. Even if this aspect is acknowledged, there is often the question raised whether a data set is compositional or not. This means, should the data be analyzed traditionally, or should it be processed with the tools from compositional data analysis. The application in this paper has shown that both types of analyses are appropriate, but they provide answers to different questions, and accordingly the outcomes have a different meaning and interpretation.
Here the focus was on mobile-phone data, which allowed to analyze the mobility of people. Further, the duration of phone calls has been investigated, including the direction of the call. All this information has been recorded in the first half year of the COVID-19 pandemic, with a special focus on the first lockdown period. After more than two years of "experience" with this pandemic it is clear that this first lockdown had the most substantial effect on people's mobility. Here we investigated if the effect differs among gender and age groups and for different regions (political areas) in Austria. Specifically, we analyzed how these differences are expressed from the point of view of the absolute data, and also in terms of relative information by making use of the log-ratio analysis.
By analyzing relative changes using compositional data analysis methodologies, formerly hidden insights can be identified. We see that specific age groups of the population restrict mobility less than other population members. This is especially visible for elderly people who have been at high risk of infections, and the younger population especially during weekends. When analyzing the absolute values, this difference between the age groups is hardly visible. We just get the impression that overall, mobility strongly decreased for all age groups with the lockdown, and then slowly increased again. Also for call duration, the absolute numbers provide different insights than relative information, but both are interesting for the analyst.
In all our compositional analyses we have considered the age groups as the contributions to the composition, and the data have been analyzed separately for females and males. One could also consider both factors, age and gender, as splitting factors for the composition. However, in this case it is not at all obvious how the composition should be treated, since centered log-ratio coefficients, as an example, would have to be computed differently. An option would be using different coordinates, such as isometric log-ratio coordinates, where the choice of such coordinates would have to be done according to an appropriate interpretability (Pawlowsky-Glahn et al. 2015). Another attempt to link the information of both genders are multi-factorial compositional objects (Fačevicová et al. 2022), which are an extension of compositional tables, see Fačevicová, Hron, Todorov, and Templ (2018).