Improving Road Freight Transport Statistics by Using a Distance Matrix

Distances driven by road freight vehicles are an essential parameter for the calculation of transport volume. In the Austrian road freight survey, places of loading and unloading are recorded on a postal code basis. To derive the actual distances driven from this data, Statistics Austria uses a distance matrix that was first created in the 1980s. While the first version of this matrix was based on manual measurements, it has recently been recreated and updated using modern routing software. This article describes the methodology on which the current Austrian distance matrix is based. The main points discussed are: how to determine representative centroids for postal code areas; how to deal with journeys within one postal code area; and how to calculate the actual distances using routing software. The last part of the article compares the distance matrix to odometer readings from the Austrian road freight survey of the reference year 2015. This comparison showed a high positive correlation which indicates the good quality of the developed distance matrix and emphasises its usefulness in road freight transport statistics.


Introduction
In the framework of the European Statistical System (ESS) and in context with principle 9 of the European Statistical Code of Practise (Eurostat 2011a) official European Statistics should be produced without excessive burden on respondents. This article presents a method for the estimation of driven distances in kilometres based on a distance matrix. This method could be easily implemented and used in the European road freight survey to simplify the collection of kilometres driven. The survey is based on EU Regulation No 70/2012 (Council of the European Union 2012) and is obligatory for all Member States (MS) except Malta.
The first part of this article describes the theory and the base data for the distance matrix. Additionally, several practical examples are introduced. The second part includes a comparison of the kilometres received from the distance matrix with the kilometres driven based on odometer information of the reference period.

The road freight survey
In general, transport statistics provide information on the transport volume and the transport performance of the different modes of transport (road, rail, inland waterways, sea, air and pipelines). Transport volume is the weight of transported goods in tonnes; transport performance is the product of transport volume and the distance in kilometres.
In contrast to other surveys in transport statistics, the nationality principle is applied to the road freight survey instead of the territoriality principle. Furthermore, the road freight survey is performed as sample survey in place of a complete survey.

The nationality principle
Compared to the territoriality principle, where all movements of a vehicle on a defined territory are observed, the nationality principle is based on collecting data of vehicles registered in the respective country. Hence on the basis of EU Regulation No 70/2012 (Council of the European Union 2012) each member state surveys the journeys of road transport vehicles -with at least a load capacity of 3.5 tonnes or maximum possible weight of 6 tonnes in case of single motor vehicles -performed on public roads within the territory of the member state as well as abroad. Agricultural vehicles, military vehicles and vehicles belonging to central or public administration 1 are not included in the survey.
In the Austrian road freight survey information on all journeys of lorries registered in Austria are collected. Due to the nationality principle there are five types of transport: • Domestic transport: Place of loading and unloading are both located in Austria. This definition includes cabotage as a special case of domestic transport, as the main focus in this article lies on the territory where the journeys take place and not the nationality of the vehicles (see figure 1a).
• International dispatch: Place of loading is in Austria and place of unloading in a different country (see figure 1b).
• International receipt: Place of loading is in a different country and place of unloading is in Austria (see figure 1c).
• Transit: Place of loading and place of unloading are not in Austria, but the journey leads through Austrian territory (see figure 1d).
• Other transport abroad : This kind of transport involves journeys of Austrian road goods vehicles, which do not take place on Austrian territory (see figure 1e).
As a consequence of the nationality principle, the road freight surveys in the member states do not include all transportation on the national territory. Instead, they contain information on transport of all vehicles registered in each member state, irrespective where it was performed.
Eurostat receives data sets from all member states and -after several plausibility checks -consolidates them to one comprehensive data set. Based on this comprehensive data set several tables can be generated and with regard to the European Commission Regulation No 6/2003 (European Commission 2003a) are distributed to the national authorities which are responsible for community transport statistics in the particular member states. These authorities 2 have the possibility to complete the statistical coverage of road transport at national level with the provided information.
Obviously, the road freight survey is a cross-national European Statistics. Hence it is of extreme importance, that the quality of the survey in each member state is high level and the concepts within the different surveys are similar and coordinated as well as possible.

A sample survey
Due to the number of registered vehicles in the member states and the high amount of journeys, the road freight survey is performed as sample survey. Based on the principles of the European Statistical System, which implicate the reduction of burden on respondents, cost effectiveness and the development of advanced statistics using modern methods, it is not deemed maintainable to implement a complete survey.
The population for the sampling procedure consists of the road freight vehicles registered in each member state. The manual"Road Freight Transport Methodology"(Eurostat 2011b) provides several recommendations for the design of the random sample. These recommendations refer to time periods (normally operations during one reference week), sampling strategies (e.g. considering different sizes of vehicles, separate strata for road tractors) and tips to avoid systematic errors (e.g. refusals, response errors, not adequate coverage of the population).
Regarding the sample size, the thresholds for the percentage standard errors are defined in the European Commission Regulation No 642/2004 (European Commission 2003b). The percentage standard errors of the annual estimates for the main variables tonnes transported, tonne-kilometres performed and total kilometres travelled loaded shall not exceed ± 5 % (95 % confidence) respectively ± 7 % if the total stock of road motor vehicles relevant to the survey in a Member State is less than 25 000 or the total stock of vehicles engaged in international transport is less than 3 000.
In Austria, the population of the survey consists of around 66 000 road freight vehicles with a load capacity of at least 2 tonnes or road tractors. Once a year (usually in December) a stratified sampling procedure (load capacity of the local unit, vehicle capacity, region, type of transport) is done for the whole reporting year. As a benefit of this yearly procedure, large companies are informed in advance about the dates of their reference weeks. To avoid a possible bias due to inactive local units or deregistered vehicles during the year, a refreshment sample is performed quarterly. On the whole, a total of 26 000 reporting weeks are collected annually.

The Questionnaire
Due to the high complexity of reported journeys (e.g. combination of laden or empty journeys, delivery or collection journeys) the design of the questionnaire for the road freight survey is a huge challenge for statisticians. Thus the questionnaire resembles more a log book than a questionnaire typically used in official statistics.
Four main tasks on the development of a questionnaire have to be taken into account in order to collect all relevant data (e.g. place of loading and unloading, type of goods or distances driven): • The questionnaire should be easy to understand and fill in.
• The respondents burden should be minimised.
• The collected information should be detailed and accurate.
• Several kinds of questionnaires (paper-based, computer-based) should be offered.
Regarding the last point it has to be mentioned, that the target group for the questionnaire is very heterogeneous. In large companies the questionnaires are usually filled in by the staff of the accounting departments, whereas in smaller companies mostly the driver is in charge of it. A study conducted by SYSTRA for the Department for Transport in the UK (Systra 2015), showed that the information sources to complete the road freight survey are quite varying. The companies use for instance run records, drivers reports/day sheets/worksheets, tachograph software, GTS or GPS systems, google maps for distance calculation, fuel cards, vehicle inspection sheets, odometer readings, company diaries, log books or smart phone applications. One result of the study was that companies typically need three different sources to complete the survey. On the one hand, all data was stored electronically and on the other hand there was a mixture of computer based information and hard copy data sources. Therefore, it is useful to design the questionnaire to be easily applicable on different kinds of media (e.g. electronic questionnaire, excel sheets, mobile phone applications or paper questionnaire), to enable each respondent to choose the appropriate kind of questionnaire.
Nevertheless, the questionnaire is complex and dynamic because its length depends on the number and type of journeys during the reference week. Hence, the effort for every respondent might be different. To support the Member States in the development of the questionnaire, Eurostat provides several suggestions through the reference manual.

Collecting information about the distances of journeys
The information of the distance for each journey is one of the essential variables of the survey as it is essential for the calculation of the transport performance. Referring to the Eurostat manual for the road freight survey, the respondents should provide this information for each journey. In practise this information is frequently not available. In the SYSTRA study it became evident that in such cases respondents use the driver's worksheet for the variables place of loading and unloading. Additionally, Google Maps or similar in-house systems are used to calculate the kilometres.
Obviously, collecting data about place of loading, place of unloading and additionally, the kilometres driven between these places raises the burden on respondents and is redundant.
Statistics Austria was aware of these difficulties already in the 1970-ies. For this reason a method was developed, which imputes the kilometres driven between the place of loading and unloading on basis of the postal codes of these places. The fundament of the imputation is a distance matrix which includes all distances between every possible postal code combination in kilometres within Austria and -with limitations -abroad.
The first version of a distance matrix was developed by using meilographs for measuring the lengths of the roads between two postal codes manually or by using algorithms based on air-line distance. It is obvious, that the development of the distance matrix was complex and labour-intensive then and it was also impossible to update it continuously.
Due to the development of modern IT-technology, powerful route planning software and GISapplications in general, nowadays the automatic generation of distance matrices is a straight forward process. The following part of the article describes methods to improve the road freight transport statistics by using a distance matrix.
3. Methods for calculating a distance matrix

General considerations
As vehicles registered in Austria could operate anywhere in Europe as well as outside, a European wide matrix would be necessary. Due to the computing time caused by calculating combinations of all European postal codes and the fact that some postal codes may change over time, an ongoing maintenance of the European postal codes would cause far too much effort. Furthermore updates of the distance matrix should be possible at regular intervals (e.g. every five years) without major changeovers of the underlying internal processes.
It is advisable to analyse the most frequently used places of loading and unloading of previous journeys. In Austria more than 90 % of the journeys of vehicles registered in Austria are performed within the national territory. Based on this information it became necessary to subdivide the methods used into journeys on national territory on the one hand and journeys abroad on the other hand. Regarding the high percentage of journeys performed on national territory it was primarily important to develop a particularly accurate matrix referring to these distances.
Concerning the distances abroad it was recommended to find a more common and especially more practicable approach. Therefore it was advisable to find a way to aggregate the postal codes abroad. One possibility would be to use NUTS 3 regions due to the fact that Eurostat already offers correspondences between NUTS 3 regions and postal codes in the tercet-database 3 . Using these already existing correspondences allows the creation of a distance matrix for all NUTS 3 regions within the European Union without considerable effort in the development.
For Austria it was more effective to keep the historical access of using so called postal code regions. These postal code regions have been evaluated in the 1980's based on regional subdivisions which summarize a respective number of territorial neighbouring postal codes. They are located below the NUTS 3 regions and hence they will provide more precise distances. In Austria more than 80 % of journeys abroad accounted to Germany, Italy and Switzerland. Taking this into account, it was required to find a method to calculate the distances for these countries and additionally another access for countries with fewer journeys.
After finding the appropriate regions (NUTS 3 or any other defined region) as a basis for the distance matrix, the next step is to decide on a centroid (geo-coordinate) representing each region. Then the calculation of the distances between all combinations of these centroids (geo-coordinates) can be performed as Origin-Destination Matrix using the appropriate GISsoftware (including routing options).
The following part of the article describes the development of a distance matrix for Austria. The description should serve as guide for other countries which are interested in developing a similar system.

Finding a representative geo-coordinate
To calculate the distances between postal code areas, a specific geo-coordinate that could be used as centroid for routing tasks had to be determined for each postal code. Initially a purely population weighted centroid based on the population numbers from the Austrian population census of 2011 and the coordinates from the register of buildings and dwellings (AGWR) was chosen. The AGWR contains address details of parcels, buildings and dwellings (including x,y-coordinates) as well as structural data for buildings, dwellings and other usage units. It is linked with the Austrian population register, and thus contains the number of people living at any given address.
Nevertheless, this method had some weaknesses as it only considered the residential population. Moreover, industrial and commercial areas were severely underrepresented. Therefore, a new method based on both the residential and the "daytime population" was implemented.
For the daytime population, the population is not counted based on the place of residency of an individual, but rather where it is likely to be during the day, e.g. on its work and school place respectively. Figure 2 and figure 3 clarify the large differences between daytime and residential population. Figure 4 depicts centroids dervied from these population measures.
In this new method for determining the central points a combination of daytime and residential population was used. It can be described as follows: • Determine the weighted centroid of a postal code area based on the sum of the residential population and the daytime population.
• Move this point to the closest building with a residential or daytime population >0 that lies inside an area of permanent settlement.
• Move this point again to the closest street or crossing, considering the rank of the street.

Distances between postal code areas
The calculation of the actual distances between the central points was based on the Tom-Tom routing network and was implemented in ESRI ArcGIS 10.1 with the Network Analyst extension. The routing system allows an accurate distance determination based on several features: • The subdivision according to road sections, which include the distance from one crossing to the next, whereas for each of these sections the maximum speed is stored. • Information about restricted road access, one way streets, toll roads and also on the overriding road network • Information, if a special road section is located in a built-up area.
• Based on the maximum speed and the information on street sections lying in built-up areas and major cities resp. Statistics Austria developed a speed model as bases for the calculations.
The route chosen was the fastest route between two postal code centroids, based on the STAT speed model (Kaminger and Vojtech 2016). Certainly, the fastest connection is not necessarily the shortest one, but experience has shown that mostly the high-level road network is used.

Distances within one postal code area
Journeys within one postal code area have both loading place and unloading place within the same postal code area. Therefore these journeys should be treated separately. These journeys are often "delivery or collection journeys" like e.g. grade supplies for retail stores, beverages deliveries or waste collections.
As the method described above could not be applied for these special cases, a different access based on the geographic extent of a postal code area was developed. Initially, only the centre points of postal code areas were available at Statistics Austria. Information about the geographic borders was not at hand. These points were used to generate a Voronoi-diagram based partitioning of Austrian national territory. The calculation of the transport distance (kmDis) was then based on the diagonal of the minimum bounding rectangle (bounding box) of the respective polygons associated with each postal code.
A straightforward approach would be to define the requested distances as half the diagonal of these bounding boxes and use it for the calculation of the kilometrage: km Dis = km Diagonal 2 Regarding the landscape of Austria it is clear that the approach described above does not fit for each area as there are many alpine regions and woods to take into account. Consequently, it was required to choose a refined approach. Thus, the share of the settlement area -the available area for agriculture, settlement and industry -was also considered. First analysis showed that taking the share of the settlement area as factor as it was it resulted in kilometre distances too low for areas with a very low share of the settlement area. Based on this experience it was decided to set the factor to 25 % at least.
km Dis = km Diagonal 2 × max(25 %, share of settlement area in per cent) In order to explain the access more practically two examples are presented in the following:

Vienna -Down Town
The bounding box for Vienna's central district has an area of 2.89 square kilometres and a 100 % share of settlement area. Half the diagonal of the bounding box is 1.5 kilometres. Therefore, the resulting distance for Vienna -Down Town is 1.5 kilometres.

Sölden
A totally different example is an alpine region like Sölden imÖtztal. The area of the bounding box is 160.7 square kilometres with a share of 3.47 % settlement area. Half the diagonal of the bounding box is 13.3 kilometres which is longer than the major road within Sölden and therefore an unrealistic high value. The weighting -based on the fact that the share of the settlement area is lower than 25 % -is done with 0.25, resulting in a calculated distance of 3.32 kilometres. This value seems to be plausible due to the fact that the total length of the only major road in this postal code is about 7 kilometres.

Calculation of the matrix abroad
As with national data, a central point (place of loading/unloading) had to be defined for each region. Since data necessary to calculate the population weighted centroids was not available for all of Europe, a different method had to be developed. If more than 10 trips to/from a postal code region were available, the weighted centre point was defined as the geographic mean of those origins/destinations. Usually that was the geographic centre of the postal code region with the most journeys. As mentioned before, more than 80 % of all journeys concerning foreign countries performed with vehicles registered in Austria have affected Germany, Italy or Switzerland throughout the last years. To be as valid as possible, all journeys of the last eleven years concerning Germany and Italy have been regarded based on the postal code combinations.
For Switzerland or if there were less than 10 trips to/from a postal code region in Germany or Italy available, the central point was defined manually based on local geographic and urban features such as industrial areas or important ports.
For other countries, it was not deemed necessary to pre-calculate any distances. Those are calculated on a case-by-case basis and inserted into the matrix as required.
Alternatively, Karner, Scharl, and Weninger (2014) describe a methodology for determining central points of NUTS 3 regions based on the Urban Clusters (European Commission, 2006) and CORINE land cover (European Environment Agency) datasets. This method can easily be implemented by anyone, as all the necessary data is free of cost available from the respective agencies.
Once a coordinate has been defined for each postal code-or NUTS 3-region, the distances between the regions can be calculated either in a dedicated GIS database or using external routing services such as google maps or open street map. For the distance matrix outside of Austria, the commercial routing software Microsoft Map Point 2011 was used. This was necessary as the routing network used for calculating Austrian domestic transport distances was only available for Austria.
Even on this aggregated level, calculating all possible routes would have been too inefficient. As the methodology presented in this paper is flexible, it is easy to update an existing matrix of pre-calculated distances on demand, if new origin-destination combinations are required.

Odometer information as benchmark for the road freight survey
To verify the developed distance matrix as well as the quality of the survey, the distances from the distance matrix were compared with the odometer information received from the questionnaire. This was done with data of the Austrian road freight survey from the reference year 2015.
As previously mentioned, the respondent has to fill in the place of loading and the place of unloading for each journey during the reference week, which is then used to obtain the kilometres driven from the distance matrix. Additionally, the number of kilometres according to the odometer at the beginning and at the end of the week has to be provided. The difference of these data represents the kilometres driven during the reference week.
For a comparison of these two data sources it has to be considered that not every journey is reportable. Journeys on private roads (such as forest roads, roads within a factory, hospital grounds or construction sites) are excluded from the survey, as are winter services (snow removal, gritting) and road maintenance. Therefore, the reported journeys are only a subset of all journeys driven during the reference week.
Another important point is that some odometer readings might be incorrect. As highlighted before and also mentioned in the SYSTRA study, the questionnaire is often filled in by not directly related departments (e.g. accounting) instead of the actual driver. As they might fill in the questionnaire after the vehicle was driven, the provided information (like the actual odometer reading at the start or the end of the week) might be incomplete.
The analysis is based on 19 583 reported reference weeks in 2015. Out of these, 4 617 weeks were eliminated because the received difference of the odometer readings was zero. Assuming that a driven distance of more than 3 000 km per week might be too high and consequently incorrect, 1 372 cases were also excluded. After these plausibility checks 13 594 weeks were used for further analysis.
The results of this comparison are recorded in table 1. This table contains aggregate statistics for 2015 such as total annual kilometres and annual kilometres grouped by NACE (Council of the European Union 2006) and transport type. It can be seen that the differences between the reported kilometres and the kilometres estimated from the distance matrix are very small. As assumed the estimated kilometres are slightly lower than the reported kilometres. This indicates that for this level of detail, the approach described in this paper works quite well. Despite the high correlation, outliers were detected. Generally, the comparison of the accumulated kilometres from the distance matrix with the odometer reading is used as plausibility check for the data of the road freight survey in Austria. If the data differs by more than 30 %, the employees of the statistical office contact the respondents to clarify the discrepancies. As the main focus of this analysis was the improvement of the survey with regard to underestimation, vehicles, whose kilometres from the distance matrix were higher than the odometer reading, were accepted and no further enquiry to the respondents was realised. In case of outliers in the opposite way, where these kilometres were lower than the odometer readings, the respondents were called to identify the reasons of the underestimations. The general feedback was that the odometer reading had been incorrect or there had been no reportable journeys during the reference week. This indicates a good quality of the developed distance matrix and emphasises the reasonable use of a distance matrix with regard to the reduction of respondents' burden. Furthermore, as this comparison is an additional possibility for plausibility checks it improves the data quality of the survey.
Another analysis included the classification of the journeys in "Hire or reward" (NACE 49.4 Freight transport by road and removal services) and "Own account" (other NACE positions). Both showed a correlation coefficient of 0.9 which was only marginal lower than in the whole sample (see figure 6). As illustrated, vehicles of "Hire or reward" drive longer distances per week than those belonging to the classification "Own account".
Furthermore, the reference weeks of the companies were analysed by the NACE activities "Manufacturing" (C10-C32), "Waste collection" (C38.1) and "Construction" (F41-F43). It can be seen that both activities "Manufacturing" and "Construction" have a higher correlation coefficient than "Waste collection" (see figure 7). On the one hand, this is due to the fact that  journeys for waste collection are quite derived. The trucks often drive around several streets within one area, for which reason the calculated kilometres are below the actual weekly driven kilometres reported by the odometer information. On the other hand, journeys for collecting waste are short and therefore the already mentioned problem with journeys within one postal code influences the discrepancy. For the activities "Manufacturing" and "Construction" it has to be kept in mind that journeys on the construction site as well as the factory site do not have to be reported. As a consequence, the calculated kilometres naturally have to be lower than the ones of the odometer information.
Regarding the analysis of kilometres within one postal code area it was investigated how the new version described before would affect the discrepancy of the two variables (see figure 8). As there were no data available for the whole reference year, the comparison was limited to the first quarter of 2015. The graph on the left site shows the old approach of the distance matrix (for every journey within one postal code 1 km was taken), whereas on the right site the approximated distance on the basis of the postal code size was used. Both using only journeys within Austria. There was no increase of the correlation coefficient, which remained at 0.86. Nevertheless, when comparing the total transport distances measured by odometer (1.10 million km) with the distances estimated with the old method (0.92 million km) and the new method (0.94 million km), it becomes clear that the new method produces slightly better results.

Conclusion
The aim of this article was to point out the advantages of a distance matrix and to present a general guidance to create a distance matrix for a country to facilitate the survey for road freight transport statistics. The distance matrix is a reasonable instrument to decrease the burden of the respondents.
In order to eliminate the obligation to calculate the kilometres driven or to record all odometer readings for the different journeys the distance can be calculated automatically by the statistical office through the place of loading and the place of unloading. It is indispensable to renew and update the distance matrix regularly as infrastructure and population focus change over time.
Together with the odometer information for a specific period it can also be used as an addi-tional plausibility check to increase the quality of the data, although it is sometimes difficult to compare the sum of all calculated kilometres with the odometer information of the reference week due to journeys which are not reportable.
The comparison of the odometer information with the distances estimated with the distance matrix showed a high positive correlation (r = 0.91) for the Austrian data of the reference year 2015. This indicates a good quality of the data and combined with the achieved reduction of respondents burden it strengthens also the use of a distance matrix in road freight transport statistics.
Certainly, there is still some work to be done on the improvement of the presented distance matrix. As seen in the last chapter, distances for driving within one postal code or for specific kinds of transport (e.g. delivery and collection journeys as waste collection) have to be analysed further as there are still discrepancies between the calculated kilometres by the distance matrix and the actual driven kilometres.
For the future there are several approaches possible: • The estimation of weights for journeys within one postal code region, grouped by parameters such as the length of the high-level road network or the extent of industrial or residential areas.
• Special questionnaires adapted for delivery or collection journeys (e.g. no type of goods) with additional information on kilometres from the respondents.
• The use of mobile apps as a new kind of questionnaire for the road freight transport survey. The main benefit would be the monitoring of accurate kilometres driven based on GPS-technology.