Visualization of Record Swapping

Record Swapping is a statistical disclosure control technique widely used to secure the confidentiality of microdata obtained in surveys and censuses. It is one of the meth-ods recommended by the Centre of Excellence on Statistical Disclosure Control of the European Union for the protection of data from Census 2021. This method is based on the swapping of risk records between geographical areas. The analysis of perturbed microdata is usually done purely on numerical results, and only simplified schemes are used for general illustration. We wanted a comprehensive visualization, and therefore, we have prepared a visualization of record swapping based on Choropleth maps and Commuting Flow maps, also known as Origin-Destination maps. Choropleth maps use differing colours within predefined geographic areas to represent aggregated statistical data. Origin-Destination maps are widely used for visualization and a description of people commuting to work, but they can also be used to describe other indicators. We utilize these methods to visualize the swapping of records for individual statistical units (persons). This approach is demonstrated using microdata from the Population Census 2011 from the Czech Republic and synthetic Austrian EU-SILC data. The proposed visualization allows statistical offices and agencies to effectively evaluate the swapping process and the distribution of the records across geographical areas.


Introduction
Record swapping is a method of statistical disclosure control that focuses on protecting data released from national statistical offices and agencies (NSO).The basic principle of this method consists of swapping high-risk households across regions based on their risk of disclosure.Record swapping is one of the two methods which are recommended by the Centre of Excellence on Statistical Disclosure Control of the European Union by Antal, Enderle, and Giessing (2017) for the protection of the Population census 2021.
This method was initially proposed by Dalenius and Reiss (1982) and since then has been examined by researchers across the world in a significant number of articles, for example Longhurst, Tromans, Tudor, and Miller (2008), Fienberg and McIntyre (2004), Shlomo, Tudor, and Groom (2010), Soria-Comas and Domingo-Ferrer (2012), Muralidhar (2017), Lukan and Smukavec (2017).While researching the scientific papers and studies, it is very noticeable that this method was never properly visualized; only simplified illustrations of the method are presented.The analysis of perturbed microdata is usually performed purely with numerical results, and the principle of the method has generally only been illustrated on a basic scheme picture.For example, the following illustrations can be found.
Figure 1 shows an illustrative example by the Census Division from National Records of Scotland NRS (2013) from the PAMS Conference. Figure 2 shows an illustration of targeted record swapping from Dove, Ntoumos, and Spicer (2018) where it was used to describe method used by the United Kingdom Office for National Statistics in 2011.Figure 3 (Dove et al. 2018) We contribute to the research and scientific papers by proposing a method to enable proper visualization.In contrast to the articles listed above, where the method is only described in the text, and the results are evaluated only numerically, we were able to create a visualization through which statistical offices and agencies can observe the swapping between regions.This approach allows statistical offices and agencies to easily visually evaluate the swapping process and the distribution of the records across geographical areas.
Our paper introduces a novel approach to the visualization of record swapping.We use wellestablished Choropleth maps and Commuting Flow maps, also known as Origin-Destination maps.These provide an intuitive tool for understanding commuting flows, for example, Figure 3: Illustrative example by (NISRA 2021) commuting to work.The underlying idea of using these maps is that swapping people between regions is the same principle as if these people were commuting to those regions.If we swap a certain number of people in order to protect data, we can visualize it with maps designed to depict people's migration.
This paper is structured in three sections.The first section outlines the process of record swapping.In the second part, we have described the basics of Choropleth maps and Commuting Flow Maps.Finally, in the third section, we have illustrated the proposed visualization method using an example derived from the 2011 Population Census data from the Czech Republic and synthetic Austrian EU-SILC data.

Record swapping
The method of record swapping was introduced for the first time approximately forty years ago by Dalenius and Reiss (1982).For further description and analysis, we have considered the adjusted method recommended by the Centre of Excellence on Statistical Disclosure Control of the European Union by Antal et al. (2017) for the protection of the Population Census 2021, which involves targeted record swapping.The idea behind this method is quite simple.Firstly, all records comprising the risk of disclosure are detected.Subsequently, these records are swapped with similar records from other geographical areas.
For the analysis, we use R programming language by R Core Team (2023) with the package called sdcMicro by Templ, Meindl, Kowarik, Gussenbauer, Development, Netherlands, and Heus (2023).
As (Shlomo et al. 2010) described, this method is targeted at high-risk households, where the swap conditions are determined by the geographical hierarchies and records that are most at risk are defined by unique cells on margins of key variables (Young, Martin, and Skinner 2009).The number of swapped records is based on two factors: the proportion of unique cells defined by the combination of key variables, estimated in used analysis by the k-anonymity rule.The second factor is defined by the swap rate, a manually set value determined by the NSO.The swap rate is defined as a p% sample of the households which will always be swapped.High-risk households are calculated from identified high-risk individuals.These high-risk individuals are determined on the basis of frequency counts of unvaried distributions on a set of key variables at different geographical hierarchies.
In the R package sdcMicro (Templ et al. 2023) the risk associated with an individual is calculated using counts over the geographic hierarchies and the combination of all risk variables.The k-anonymity rule is used for the estimation of the risk r i,h for each individual i in the geographic hierarchy h and can be defined as where v 1 , . . ., v p is defined as a set of risk variables, N g 1 being calculated as the number of persons living in the region g 1 and 1[...] as the indicator function.They describe this function more understandably in the following notation Sampling probability p y,h for the household y at the geographic hierarchy h is calculated from the risks r i,h of all individuals living in the same household and is defined by the maximum risk across all household members as On each of the defined hierarchical levels, records of households are checked against the k-anonymity rule.If the record has a unique combination of risk variables and, therefore, does not satisfy the k-anonymity rule, the household is swapped with the so-called donor household.The households chosen as donor sets of records for given swapped households are defined by sampling probability defined above and are taken from a different geographic region at all times.This process is repeated gradually across all geographical levels from the highest hierarchy level to the lowest.The swap rate defines the total number of least swapped households from which two scenarios can arise.Firstly, there were no records left to swap that did not meet the k-anonymity rule.In this case, households are further swapped until the number of swapped records set by the swap rate is fulfilled.Secondly, more households have already been swapped than was the set swap rate, but there are still records that do not fulfil the k-anonymity rule.These records are then further swapped in order to be protected.(Templ et al. 2023) For more information on targeted record swapping in census, see Antal et al. (2017) and Shlomo et al. (2010).

Visualization
Given the multitude of data visualization methods, we consider Choropleth maps and Origin-Destination maps the most apt techniques for effectively displaying the swapping process.
Choropleth maps (Chen, Härdle, and Unwin 2008) use differing shades or colours within predefined geographic areas to represent aggregated statistical data.We use classed choropleth maps that class intervals that convert continuous estimates into an ordered variable with a few values that can be represented using a few colours.
The Origin-Destination statistics are a popular output of the population census from the national statistical offices.Commuting to work or school is one of the most analyzed components of population mobility.Articles that deal with this area are, for example, Martin, Gale, Cockings, and Harfoot (2018) or Šanda (2020).
National statistical offices prepare intelligible maps for the users of the commuting statistics, also known as Origin and Destination data or just as flow data.dataset of flow data from the population census 2011, as well as by the statistical office in Austria in their STATatlas.
We have followed the procedure described by (Lovelace, Nowosad, and Münchow 2019).The Origin-Destination data can be represented by areal units, desire lines, routes, nodes or route networks.In our opinion, desire lines, which are straight lines representing records of people's travels, offer the best visualization.These lines are connected between points or zone areas in geographic space.Desire lines are created by connecting the origin's centroid and the destination zone's centroid.The name desire line comes from transport purposes as the line should represent where people desire to go between zones.The line connecting the origin and the destination zone is drawn into the map as a straight line because it should represent the quickest route between points A and B without any obstacles.Therefore, this type of visualization is best for our purposes of mapping the swapping of risk households between regions.
The data structure needed for the visualization is quite simple.The data set would be constructed with only three columns, as illustrated in table 1.The first column, "Origin", represents the swapped household's region of origin, the second column, "Destination", represents the swapped household's destination region, and the third column represents the sum of swaps from the region of origin to the destination region.
For the visualization, we have used the R programming language by R Core

Example of the visualization
We present the proposed visualization first on the data from the Population Census 2011 from the Czech Republic.The data for displaying the visualization proposal was not randomly selected, but the idea of visualizing this method came to us while we were analysing the impacts of this method on the census data.At the end of this part, we also present the visualization of another dataset for which we have chosen synthetic Austrian EU-SILC data from the R package laeken by Alfons and Templ (2014).
For the visualization, we propose to use Choropleth maps and Commuting flow maps, also known as Origin-Destination maps.The idea behind using Origin-Destination maps is that swapping people between regions is the same principle as if these people commute for work to that region.If we exchange a certain number of people in order to protect data, we can visualize the swap with maps designed to show the migration of people.
Origin-Destination maps can be used to identify the most significant movements of households using the record swapping method.At the same time, they serve as an elegant visual check that the method behaves according to its set parameterization, for example, in the case of a condition that the household does not leave the borders of its region.Choropleth maps then show effortlessly and clearly which regions are most affected by the given method.
The following part consists of 12 maps that show the swapping of households in the Czech Republic.We ran the record swapping function of the sdcMicro package in the R language.The model parameters were set to the k-anonymity rule on level 3, with the swap rate set at 10 %.The geographical hierarchy parameter of the function was set at regions (NUTS 3), districts (LAU1) and municipalities (LAU 2).Risk variables, from which the k-anonymity rule was calculated, were set as variables of age category, sex, nationality, and religion.To show different situations, we ran the model two times, with the only difference between models being in the set similarity profile parameter.Both models had set variables defying the household size in similarity parameter, so the swapping between the households would occur only between households of the same size.Firstly, we performed the model only with this parameterization, which caused swapping households across the whole country.Secondly, the model has added the parameter of regions (NUTS 3), which stipulates that households have to stay within the borders of their respective NUTS 3 regions.We differentiate between these two models below as models with or without the parameter of the region.
The visualization can be plotted on many different geographical hierarchy levels.We present the differences in the geographical levels of regions (NUTS 3), districts (LAU 1) and municipalities (LAU 2).For the visualization, we have used the R programming language by R Core Team ( 2023  The visualization can be plotted on many different geographical hierarchy levels.We present the differences in the geographical levels of regions (NUTS 3), districts (LAU 1) and municipalities (LAU 2).For the visualization, we have used the R programming language by R Core Team ( 2023 Origin-Destination maps, which we here call the maps of swaps between regions, show the connections between the centroids of each region and the frequencies of swaps are colourcoded.To emphasize the swaps with the highest frequency, we have highlighted them on the The data from the Population Census 2011 from the Czech Republic are used for the first part of the visualization.It is necessary to emphasize that the resulting maps do not show the final settings of the given method for the protection of census data in the Czech Republic.
The given maps are only illustrative in the hypothetical use of the given method.
Figure 4 shows a map of swaps between NUTS 3 regions without applying the condition that the household must stay within the borders of the NUTS 3 region.We can see that most households are swapped between the Stredocesky region with the Jihomoravsky region and the Stredocesky region with the Vysocina region.The most swapped households are from the Stredocesky region.The lowest swapping rate can be observed in the Karlovarsky region.In contrast, there is figure 4 that shows a map of swaps between NUTS 3 regions with the condition.If we apply the condition that households have to remain within the borders of the NUTS 3 region while we plot the swapping lines at the NUTS 3 region, we shall correctly observe an empty map because otherwise, the condition would be violated.More information can be seen in figure 7, which shows the choropleth map for counts of swaps between NUTS 3 regions with the application of the given condition.Prague region is missing from the map because Prague is its own region; therefore, the swapping did not take place.The most swapped households were again in the Stredocesky region and the least in the Karlovarsky region.Compared with Figure 5, the Moravskoslezky region had more swaps and Plzensky region had fewer.If we move to the lower level of LAU 1 districts, we can analyse figure 8, which shows a map of swaps between LAU 1 districts without applying the parameter of the region.On this level, clarity begins to fade with growing details.However, we observe from the map that the largest swaps occur between the centre of the country and its eastern parts.In contrast, western districts have darker lines, indicating lower swapping rates.The capital is apparent on the map with the highest swapping lines.In the choropleth map in figure 9, we can see that the most significant number of swaps occurred in the biggest city of the Czech Republic, Prague, but the second largest city, Brno (district Brno-mesto), is at the same level as the rest of the districts.Interestingly, there were more swaps around Brno (district Brno-venkov) and near Brno (district Trebic and Zdar nad Sazavou).
In contrast, we have figure 10, which shows a map of swaps between LAU 1 districts with the parameter of the region.The effect of the capital is now clearly eliminated because, in the Czech Republic, the capital is its own region.The biggest swaps can be observed in the The lowest level of LAU 2 municipalities is exceptionally cluttered.This is the situation of figure 12, which shows a map of swaps between LAU 2 municipalities without the condition.
Although the highest swaps are located in the middle of the republic, there is a clear pattern of swaps between the west and east parts of the republic.This would be unacceptable if applied to the final data for dissemination.At the level of municipalities in figure 13, the largest cities with the largest number of swaps are coloured the most.
In contrast, the last map, figure 14, shows a map of swaps between LAU 2 municipalities with the condition.Here, it can be seen that in each region, based on the colour perception of the swap lines, there is approximately the same level of swapping.It is not evident that any region would be overestimated or underestimated.The largest swapping occurs in the areas with the largest populations, so no suspicious movements are indicated.At the level of municipalities in figure 15, the largest cities where the most swaps took place are coloured the most.Interestingly, in the Moravian-Silesian region, there are quite a number of municipalities with a high number of swaps.
The last maps use synthetic Austrian EU-SILC data.We present the swapping process at the geographical level of NUTS 2, which are federal states.As risk variables, three variables were used: a person's age, economic status, and citizenship.The swap rate was set to 5 %, and only households of the same size were swapped.As figure 16 and figure 17 show, most people were swapped from Viena to its close regions, Upper Austria, Lower Austria and Styria.The fewest persons were swapped in the Vorarlberg and Burgenland regions.

Conclusion
We propose a new approach for the visualization of record swapping.Targeted record swapping is a method recommended by the EU Centre for Excellence in Statistical Confidentiality Protection for microdata protection for the 2021 Census of Population, Housing and Dwellings.However, in scientific papers, the analysis of perturbed microdata using the record swapping method is always performed purely on numerical results.It means only comparing the difference between the original microdata and the new protected microdata.
For the visualization, we have used the Choropleth maps and Commuting Flow maps, also known as the Origin-Destination maps.These maps serve as an intuitive tool for understanding commuting flows, for example, commuting to work.The idea behind these maps is that we can use the same principle for swapping people between regions as if these people commuted to that region.If we exchange a certain number of people in order to protect microdata, we can visualize such swaps with maps designed to show the migration of people.
We find this depiction of the swapping of people very beneficial.Thanks to simple visualization of the Origin-Destination lines, we can track suspicious swap movements and unwanted patterns that may occur during swapping based on a visual evaluation of the map.We consider greater swapping of households from one corner of the republic to another to be highly undesirable, and thanks to the proposed visualization, we can quickly trace this undesirable behaviour.These maps have proven to be very useful in communication with statisticians when enforcing the highest level of statistical disclosure control.Of course, the outcomes of the record swapping can be presented to the management for a more straightforward decisionmaking process.
The result of the proposed visualization design is a map of the country where the microdata protection process is plotted.This will simplify the process for employees of national statistical offices to visually evaluate the protection process, enhancing the security of disseminated data and increasing its availability for further research.Visualisation of record swapping will provide both an explanation of perturbed microdata and the quality control of this process.

Acknowledgment
This paper has been prepared with the support of a project of the Prague University of Economics and Business -Internal Grant Agency, project No. F4/50/2021.
Figure 1 shows an illustrative example by the Census Division from National Records of Scotland NRS (2013) from the PAMS Conference. Figure 2 shows an illustration of targeted record swapping from Dove, Ntoumos, and Spicer (2018) where it was used to describe method used by the United Kingdom Office for National Statistics in 2011.Figure 3 shows illustrative example by Northern Ireland Statistics and Research Agency NISRA (2021) from their Census Statistical Disclosure Control Methodology.

Figure 4 :
Figure 4: A map of swaps between NUTS 3 regions without the parameter of the region ) with the packages tmap byTennekes (2018), stplanr byLovelace and Ellison (2019), tidyverse byWickham et al. (2019), RCzechia byLacko (2021), giscoR by Hernangómez, EuroGeographics, and Arel-Bundock (2023).The package tidyverse was used for data tidying for purposes of other functions.The package RCzechia provided geopolygons of the Czech Republic's geographical hierarchy.The package giscoR provided geopolygons of the Austrian's geographical hierarchy.The package stplanr has provided the main functions for the Origin-Destination analysis.The package tmap has been used for the final visualization of the maps.

Figure 6 :
Figure 6: A map of swaps between NUTS 3 regions with the parameter of the region ) with the packages tmap byTennekes (2018), stplanr byLovelace and Ellison (2019), tidyverse byWickham et al. (2019), RCzechia by Lacko (2021), giscoR byHernangómez et al. (2023).The package tidyverse was used for data tidying for purposes of other functions.The package RCzechia provided geopolygons of the Czech Republic's geographical hierarchy.The package giscoR provided geopolygons of the Austrian's geographical hierarchy.The package stplanr has provided the main functions for the Origin-Destination analysis.The package tmap has been used for the final visualization of the maps.

Figure 10 :
Figure 10: A map of swaps between LAU 1 districts with the parameter of the region

Figure 12 :
Figure 12: A map of swaps between LAU 2 municipalities without the parameter of the region

Figure 14 :
Figure 14: A map of swaps between LAU 2 municipalities with the parameter of the region

Figure 16 :
Figure 16: A map of swaps between NUTS 2 regions in Austria

Table 1 :
Example of O-D data The data(ONS 2011)represents the flow of people from one place to another and is mainly focused on Migration Statistics, Workplace Statistics, Residence Statistics and Student Statistics.Well-made interactive maps of commuting flows were prepared in Great Britain due to openly accessible