Investigating the Dark Figure of COVID-19 Cases in Austria

The number of undetected cases of SARS-CoV-2 infections is expected to be a multiple of the reported figures mainly due to the assumed high proportion of asymptomatic infections and to limited availability of trustworthy testing resources. Relying on the deCODE genetics study in Iceland, which offers large scale testing among the general population, we investigate the magnitude and uncertainty of the number of undetected cases COVID-19 cases in Austria. We formulate several scenarios relying on data on the number of COVID-19 cases which have been hospitalized, in intensive care, as well as on the number of deaths and positive tests in Iceland and Austria. We employ frequentist and Bayesian methods for estimating the dark figure in Austria based on the hypothesized scenarios and for accounting for the uncertainty surrounding this figure. Using data available on April 01, 2020, our study contains two main findings: First, we find the estimated number of infections to be on average around 8.35 times higher than the recorded number of infections. Second, the width of the uncertainty bounds associated with this figure depends highly on the statistical method employed. At a 95% level, lower bounds range from 3.96 to 6.83 and upper bounds range from 9.82 to 12.61. Overall, our findings confirm the need for systematic tests in the general population of Austria.


Introduction
The number of confirmed infections with the SARS-CoV-2 is a central figure which many studies and models rely on for further analysis and also for evaluating whether the social distancing measures have proven effective. The same figure is being reported by media outlets and is the main information directed to the general public.
However, reports and scientific studies from different countries estimate the true number of infections to be a multiple of the reported figures (Li, Pei, Chen, Song, Zhang, Yang, and Shaman 2020; Maugeri, Barchitta, Battiato, and Agodi 2020; Zhao, Musa, Lin, Ran, Yang, Wang, Lou, Yang, Gao, He et al. 2020). The number of undetected cases, the so-called dark figure, is expected to be large mainly due to two reasons: i) the characteristics of the disease and ii) the testing strategy (Czypionka and Reiss March 19, 2020). A dangerous characteristic of the COVID-19 disease is that the population is highly susceptible due to lack of antibodies and that transmission is likely to be carried out by infected individuals who exhibit light to no symptoms. Several studies report that roughly 50% of the infections are asymptomatic, which makes the early detection and isolation of the infected problematic (see, e.g., Nishiura, Kobayashi, Miyama, Suzuki, Jung, Hayashi, Kinoshita, Yang, Yuan, Akhmetzhanov et al. 2020;Day 2020;Shahan March 21, 2020;Castelfranco March 16, 2020). This means that even if all the symptomatic infections were recorded, the dark figure would be at least as high as the number of confirmed infections. Moreover, the infected individuals with mild symptoms are likely not going to get in contact with the health care system and will also not be recorded in official statistics. Therefore, extensive testing can be a viable strategy for accurately estimating for the prevalence of the disease in the general population. Nevertheless, countries around the world are struggling to set up such a strategy due to e.g., costs, testing capacity and availability of testing kits. Moreover, country-wide screenings with PCR (polymerase chain reaction) tests only provide a current snapshot of the active infections on a target date. As people who were infected but do not excrete the virus at this specific date are tested negative, these screenings do not offer information on the total number of individuals who have been infected with SARS-CoV-2.
Reports estimated the number of infected in Italy to be around 3.5 times higher than reported as of February 29, 2020 (Tuite, Ng, Rees, and Fisman 2020). Slightly lower estimates have been given for Germany (Kekulé March 14, 2020). In Austria, the limited test capacity has been an important set-back in the government's testing strategy. The capacity of PCR testing kits has been relatively low with the test capacity increasing slowly relatively to the number of infections (Czepel March 26, 2020). Secondly, until around March 22, 2020, only people with contact to confirmed infected people or coming from high risk areas such as Italy or China were tested. These issues point to a substantial number of unrecorded infections in Austria. Previous estimations place the number of infected in Austria between 16 000 to 55 000 as of March 18, 2020 (Czypionka and Reiss March 19, 2020;Sturn March 27, 2020;ORF March 27, 2020). In the four day period between March 31, 2020 and April 03, 2020 a field study commissioned by the Austrian government will be conducted by the research institute SORA, where 2000 individuals will be tested for COVID-19 in Austria. The results of this study will shed light on the number of active infections in Austria by ensuring that the sample of tested individuals is representative for the general population, but it does not provide information on the accumulated number of SARS-CoV-2 infections in Austria (ORF March 30, 2020). For this purpose reliable and accurate i.e., sensitive and specific antibody tests would be necessary.
In the absence of an extensive field study, the attempts to quantify the dark figure of infections can only produce rough estimates as they rely on several assumptions and/or on results from different countries. The difficulty in estimating undetected infections arises mainly due to data quality issues. The available data is highly dependent on the testing and reporting strategies of the different countries. Comparisons are not straightforward as most of the countries apply different testing approaches. Italy, for example, mostly focuses on testing in hospitals with symptoms (Onder, Rezza, and Brusaferro 2020). This results in a high proportion of positive tests and also the mortality rates are apparently higher in the tested sample due to more severe cases being tested. The roughly 50% asymptomatic are not covered by this approach so one can assume that the true mortality rate is lower than reported. Other countries have delayed or performed very limited testing until very recently. This was the case in the US, where less than 1000 cases were reported until March 10, 2020. Since then the number has increased by a factor of 196 times as of March 31, 2020 (Dong, Du, and Gardner 2020).
While the data quality issue is one most countries are battling with, one can make use of studies in other countries which have a better testing strategy. Iceland can be seen as a pioneer in this respect as they launched a large scale testing of general population (see Government of Iceland March 21, 2020). The bio-pharmaceutical company deCODE genetics launched a testing program, where they test people on a voluntary basis in order to get a better understanding about the spread of the coronavirus across the country. As of April 01, 2020, Iceland has tested a higher proportion of its citizens than almost any other country in the world (except the Faroe Islands) and through the deCODE genetics study is in a position to obtain a rough approximation of the unrecorded infections in Iceland (Nardelli and Ashton March 21, 2020).
The goal of this article is to obtain rough estimates of the dark figure of infections in Austria by borrowing information from the screening of the general population in Iceland. The approach we propose in this article consists of defining several scenarios where we hypothesize on the relationship between the observed figures in Iceland versus the observed figures in Austria. While this study follows a rather naive approach and disregards most of the complexity of the problem by making simplifying assumptions, it is also, to the best of our knowledge, the only one making use of the Icelandic case study in other to extrapolate results to other countries and should provide rough approximations of the unrecorded infections in Austria while serving as a starting point for further research.
This remainder of the article is organized as follows: Section 2 introduces the deCODE genetics study in Iceland. Section 3 provides a comparison of Austria and Iceland in terms of COVID-19 figures. The different scenarios and our approach for estimating the dark figure in Austria are presented in Section 4. Section 5 discusses the findings and Section 6 provides concluding remarks.

Study performed by deCODE genetics
From March 15, 2020 until March 31, 2020, 10401 individuals were tested by the medical research company deCODE genetics, who joined forces with the Icelandic authorities in performing a large scale SARS-CoV-2 testing in Iceland. The company offers screening on a voluntary basis among the general population, by offering a free coronavirus test to anyone who fills in an online form. This approach to testing leads to a higher representation in the sample of non-symptomatic non-quarantined individuals and the sample of tested people can be considered to be more representative of the whole Icelandic population than the sample of individuals tested by the authorities, who are mainly symptomatic or considered high risk. Therefore, assuming that the deCODE genetics study covers a representative sample would allow conclusions to be drawn about the prevalence of the disease in the general population. The tests performed by deCODE genetics, as well as the ones performed by the Department of Microbiology of the National University Hospital of Iceland (NUHI), use the technology of polymerase chain reaction (PCR), which is considered to date to be the most accurate for COVID-19 diagnosis.
The number of infections discovered by the deCODE genetics study in the above mentioned time frame lies at 84. This constitutes a prevalence rate in the tested sample of 0.81%. Assuming that the sample of deCODE genetics is representative for the whole Icelandic population, 0.81% can be used as an estimate for the prevalence of the COVID-19 disease in the general population with a 95% frequentist bootstrap confidence interval (CI) of [0.64%, 0.98%]. This would correspond to a number of 2941 infected people in Iceland (with a 95% CI of [2314,3567]). When extrapolating from the deCODE genetics sample to the whole population, these numbers would imply that, as of March 31, 2020, the official statistic of 1136 infections reported by the NUHI is underestimating the number of cases by 1805 (95% CI: [1178,2431]), with only 38.63% of the cases being recorded officially (95% CI: [31.85%, 49.09%]). This indicates that the number of infections is 2.59 times larger than officially reported (95% CI: [2.04, 3.14]).

Comparison between Austria and Iceland
Iceland has a population of 364134 (reported by https://statice.is/ on January 1, 2020), while Austria's population of 8902600 (according to Statistik Austria) is 24.45 times larger. In terms of health care systems, the two countries are in the top 15 health care systems in the world based on a standardized set of metrics on health system performance, with Austria being few places ahead of Iceland (around place 9). 1 Table 1 shows the number of deaths, number of hospitalized patients, among the hospitalized the number of patients in intensive care, the number of confirmed infections and the number of tests performed in each country both in absolute value and per million of population (PMP). Note that the reported statistics for Iceland include only the results of the NUHI testing for comparison purposes, as both the Austrian and the NUHI testing focus mostly on symptomatic cases.
While in Iceland -as of April 01, 2020 -4 people diagnosed with COVID-19 have died, at the same time in Austria 146 fatalities have been recorded. Among the active cases, 35 are currently hospitalized in Iceland and 1071 in Austria and 11 are reported to need intensive care in Iceland, compared to 215 in Austria. The per million of population figures can be compared among the two countries, with Iceland having a relatively higher number of intensive care patients, higher confirmed infections and roughly 4 times more tests compared to Austria. In Austria 18.76% of the tests are positive, in Iceland we observe a percentage of positive tests of 12.46%. This might be due to the relatively higher number of conducted tests in Iceland. Time series of the tests and the confirmed infections for the two countries are shown in the Appendix, Figure 7 and Figure 8. Figure 1 shows the age distribution of the population in the two countries. While Iceland has relatively more people in the younger age groups, the percentage of people above 45 is relatively higher in Austria with the largest different being in the older adult population above 65.
When comparing the age distribution of the infected people presented in Figure 2, we find similar prevalence rates among most age groups. Only for the age groups from 35-44 and 65+ clear differences are observed. Within the 35-44 interval the proportion of infected people is higher in Iceland, while in Austria relatively more people aged 65 and above are infected among the tested individuals. Partly, this can be explained by Austria having a larger proportion of older adult population but could also highlight the different testing strategies of the two countries, where in Austria detection in non-risk groups might be less likely. Note that the number of infections in Iceland is given by both the deCODE genetics and the NUHI samples.

Estimation of the dark figure in Austria
When trying to estimate the dark figure of SARS-CoV-2 infections in Austria, we consider five different scenarios based on available figures from the ministries of health of the two countries.
In each of the scenarios we estimate the prevalence of COVID-19 in Austria by multiplying the estimated prevalence of the deCODE genetics study by a procedure-specific multiplier k.
The scenarios focus on the number of hospitalized cases, the number cases in intensive care, the observed deaths due to COVID-19 and the ratio of positive SARS-CoV-2 tests.

Scenario I: Ratio of hospitalized per million inhabitants
In the first scenario the multiplier is calculated by the number of hospitalized people in Austria divided by number of hospitalized people in Iceland (per million of population  recorded as of April 01, 2020. This would indicate that the number of infections is 5.48 times larger than officially reported.

Scenario III: Ratio of deaths per million inhabitants
In this scenario the multiplier is given by the ratio of deaths in Austria and deaths in Iceland. In both countries a deceased person who has previously tested positive for the COVID-19 disease is counted as a death in the official statistics. 2 As of April 01, 2020, 4 (10.98 per million) persons have died in Iceland, while 146 (16.4 per million) deaths have been observed in Austria. This gives a multiplier of 1.49 for this scenario.
The estimated number of infections in Austria is 107339 in this scenario. Thus, as of April 01, 2020, 96857 people were infected but not recorded in the official statistics and only 18.24% of the estimated infections were recorded. This would indicate that the number of infections is 10.24 times larger than officially reported.

Scenario IV: Similar prevalence in the population as in Iceland
We assume the percentage of infected population in Austria is equal to the one in Iceland and use the prevalence of 0.81% estimated from deCODE genetics data to draw conclusions about the dark figure in Austria. This assumption implies that the disease had a similar development in the two countries.
When assuming a similar prevalence rate as in Iceland, we estimate a number of 71899 infected people as of April 01, 2020. This means that 61417 people were infected but not recorded and only 14.58% of the estimated infections were recorded. This indicates that the number of infections is 6.86 times larger than officially reported.

Scenario V: Ratio of the percentage positive SARS-CoV-2 tests
Throughout this scenario, we analyse and compare the testing behavior of both countries. A closer look at the testing results shows that Iceland has conducted roughly 3.99 times more tests per million people compared to Austria, but confirmed only 2.65 times as many infections per million people compared to Austria. Under the (strong) assumption that the probability of obtaining a positive test result does not decrease with increasing testing size, we take the ratio of the percentage of positive tests in Austria and the percentage of positive tests in Iceland of 1.53 as a multiplier of the prevalence rate.
As of April 01, 2020, we estimate a number of 109926 infected people for this scenario. This means that 99444 people were infected but not recorded and only 9.54% of the estimated infections were recorded. This would indicate that the number of infections 10.49 times larger than officially reported.

Assessing the uncertainty in the estimation of the dark figure in a simulation study
There are numerous sources of uncertainty in estimating the dark figure, some of which can be appropriately accounted for in a statistical modeling framework: i) the plausibility (likelihood) of the different scenarios, ii) the uncertainty related to the true value of the multiplier k, iii) uncertainty in the prevalence of the disease in Iceland.
In this section we address these points in the following way. Regarding i), the assignment of probabilities to each of the scenarios is a challenging task due to disagreement of experts and the lack of data availability. Therefore we resort to simulation and in a first experiment, simulate weights for the different scenarios from a prior distribution and assume the prevalence of the disease in Iceland to be fixed. This allows us to obtain a distribution of the expected value for the number of infections (and DM) over all scenarios. In a second experiment, we propose a Bayesian statistical model for the prevalence rate in Iceland. Moreover, we shift focus from the different scenarios to the multiplier k and propose a prior distribution for this quantity.
Due to the limited testing availability in Austria, the number of recorded infections is an unrealistic indicator when comparing the prevalence rate of Iceland and Austria. Testing data in Austria cannot be assumed to represent a picture of the whole population and but rather it displays the infections in a subgroup of the population. Currently, the most comparable, trustworthy and informative indicators are observed figures in the hospitals like the number of hospitalized people, the number of people in intensive care and the observed number of deaths. Other meaningful scenarios are: assuming a similar prevalence rate like in Iceland as the first recorded infections took place in both countries at a similar point in time; using the ratio of the percentage of positive tests as a reasonable estimation/guess of a multiplier of the prevalence rate. Therefore, we focus on these five scenarios for our statistical modeling purposes.

Simulating the probability of the five scenarios
We choose a Dirichlet prior on the probability of the five scenarios: where α 1 = α 2 = . . . = α 5 ≡ α is a hyper-parameter to be chosen. Setting all the parameters equal implies that in the long run we believe the scenarios to be equally likely. In the simulation we quantify the variability of the DM with respect to the variability in the probability of the scenarios. In this subsection we consider the prevalence in Iceland to be fixed to 0.81% in this experiment.
We choose two values of hyper-parameters of the Dirichlet distribution: α = 1 which corresponds to assigning evenly distributed probabilities to the scenarios, and α = 0.1 which corresponds to assigning most mass to one of the scenarios. For a visualization of the implied marginal prior distributions of p i , i = 1, . . . , 5, see Figure 3.
In our simulation, we draw 10000 samples from the Dirichlet distribution and weigh the estimated number of infections in each scenario by the sampled weights in order to obtain the expected value of the DM. For both values of the hyper-parameters we obtain the same mean of 8.34 for the DM. Figure 4 shows the distribution of the DM under variability of the scenarios. More mass is assigned to the tails of the distribution under the case α = 0.1. This is due the fact that for this set of hyper-parameters the more extreme scenarios get more weight because the probability of one of the scenarios is close to one and the others close to zero. The modes correspond to the estimated multipliers calculated in the scenarios.

Statistical modeling of the prevalence rate
The prevalence of the disease in Iceland's general population has been considered fixed in the previous experiment. In order to account for the uncertainty surrounding this quantity, we make use of the beta-binomial model. In this Bayesian model, a Beta prior is assumed on the prevalence rate in Iceland p ISL : where a and b are hyper-parameters to be chosen. The prior distribution on the prevalence of the disease should assign high mass on small values of p ISL . Suitable hyper-parameters could be e.g., a = 1 and b = 50 which implies a prior mean and a prior standard deviation of roughly 0.02.
The number infected people in the deCODE genetics sample N ISL inf is then modeled as a binomial distribution with size equal to the number of tested individuals by deCODE genetics N ISL test and probability p ISL : N ISL inf |p ISL ∼ Bin(N ISL test , p ISL ). We use the deCODE genetics results in order to estimate the true number of infected in Austria. We assume that N AUT inf , the true number of infected individuals in Austria is a binomial random variable with size equal to the number of inhabitants of Austria N AUT pop and probability which is a multiple of the probability in Iceland k · p ISL : where k is a multiplier. For k fixed, this would represent the posterior beta-binomial predictive distribution, where the posterior probability of p ISL would be given by: However, there is also uncertainty coming from the value of k. We consider here two prior probability distributions for the multiplier k.
Prior 1: Discrete probability We consider each of the five values of k presented in Table 2 equally likely.
Prior 2: Mixture of gamma distributions It is unrealistic to assume that k only has 5 possible values. A more realistic assumption is that k is a continuous variable coming from a mixture of five distributions which have their mean at the values computed in the different scenarios. Moreover, the variance of the different components should differ among the scenarios, due to the few observations observed in some of the scenarios. For example, the value of k for the third scenario is computed by using data on only 4 deaths in Iceland, so this component can be expected to have higher variance as one additional death can change the multiplier significantly. Given that the multipliers are positive values, we assume a gamma distribution for the components: where S denotes the number of scenarios, φ s denote the probability of scenario s, α s > 0 is the shape and β s > 0 is the rate parameter of the gamma distribution for scenario s. Table 3 presents the chosen hyper-parameters each of the gamma distributions corresponding to the scenarios. We choose the hyper-parameters α s and β s such that the mean of the distribution equal the value of the multiplier estimated in each of the scenarios and the variance equals predetermined values.
The mixture of gamma prior for the multiplier k is displayed in Figure 5.  Sampling After simulating 10000 values for k from the priors introduced above, we sample p ISL from the distribution in Equation 2. Finally, N AUT inf is simulated from the binomial distribution given in Equation 1 where we replace k and p ISL with the sampled values. Figure 6 provides the predictive distributions for the DM with the two prior specification. The average DM lies at 8.39 for prior 1 and at 8.38 for prior 2. Table 4 displays a summary of the ratio of estimated infections and confirmed infections with corresponding intervals for quantifying the uncertainty for all five modeling approaches. We find that the mean estimates of the DM are between 8.3 and 8.4 for all five approaches. In the frequentist approach, where we average over the equally weighted scenarios, we find a mean estimate of 8.33 with a 95% frequentist bootstrap confidence interval of [6.56, 10.11]. When simulating the probabilities of the five scenarios with a Dirichlet distribution with α = 1, we find similar results like in the frequentist approach, while a hyper-parameter of α = 0.1 increases the uncertainty bounds as more mass is assigned to one of the scenarios in each of the 10000 simulations. When accounting for both the uncertainty in the prevalence in Iceland and the value of the multiplier in a beta-binomial model, the mean estimates are 8.39 for prior 1 and 8.38 for the mixture of gamma prior on the multiplier k. As expected, the range of the credible intervals increases as more uncertainty is being acounted for. Especially for the mixture of gamma prior, where a plausible continuous range for the multiplier k is being accounted for, the uncertainty bounds are the widest. This final approach also incorporates uncertainty regarding the calculation of the multipliers in the scenarios.

Discussion
The figures used in computing the multipliers between Austria and Iceland in the different  At the time of the revision of the paper, the aforementioned field study conducted by the research institute SORA has been completed on 1544 Austrian individuals (this corresponds to 173 tests per million inhabitants). Among these, 0.33% active infections were confirmed (Ogris and Hofinger April 10, 2020). This figure provides a snapshot for active SARS-CoV-2 infections for the period April 01, 2020 to April 06, 2020. Note, however, that the SORA study is not directly comparable with the deCODE genetics study for several reasons. First, the sample size is much larger in the deCODE genetics study, with 16886 tests (46373 tests per million inhabitants) performed by April 06, 2020. Second, the sample selection is voluntary in the deCODE study but more complex in the SORA study (Ogris and Hofinger April 10, 2020). Third, the studies are conducted over different time-periods, with the deCODE genetics study being still ongoing since March 15, 2020. This is important because infected who recovered before the beginning of each testing period are not accounted for. With the population being widely compliant with the social-distancing measures imposed by the Austrian government on March 13, it is plausible that the number of newly infected started to decrease immediately.
Hence, we suspect a higher prevalence rate in Austria one or two weeks before April 06. 3

Conclusion
Our findings suggest that the number of undetected infections of SARS-CoV-2 in Austria, both active and recovered, is a multiple of the reported figures, with an average ratio of estimated infections and confirmed infections of around 8.36. However, the uncertainty surrounding this estimate is significant. We employ several frequentist and Bayesian methods to appropriately account for the uncertainty and find that plausible estimates may well lie in the interval [3.96, 12.61]. The analysis relies on the deCODE genetics study in Iceland, which offers large scale testing among the general population. We investigate the magnitude and uncertainty of the dark figure of the SARS-CoV-2 infections in Austria by formulating several scenarios relying on data on the number of COVID-19 cases which have been hospitalized, in intensive care, as well as on the number of deaths and positive tests in Iceland and Austria.
We provide simulation experiments in a statistical framework for accounting for different sources of uncertainty such as the likelihood of the different scenarios, the prevalence of COVID-19 in Iceland and the uncertainty regarding the multiplier between the prevalence in Austria and the prevalence in Iceland.
One of the primary limitations of the proposed framework is the assumption that the deCODE genetics study in Iceland consists of a representative sample of the entire population. It is to be noted that the application for testing occurs on a voluntary basis through an online form, which can be a source of self-selection bias leading to the underrepresentation of some age groups or to an over-proportional percentage of infections in the sample. A second limitation is the reliance of the study on key figures reported by the two governments where a lag in reporting is likely. Moreover, while the data is mostly comparable, there might still be differences in the definitions of the reported figures. Third, we assume a static setting for modeling the prevalence rate in both countries, which does not control for time span over which the deCODE genetics study has recorded the observations. Last, other factors such as the handling of the COVID-19 disease might to some extent vary among the countries. While some of the performed experiments are able to account at least partially for these issues, we stress that our results remain dependent on the assumptions made throughout the analysis.