Family Generation Process from Administrative Data Sources and the Austrian Register-Based Census 2011

For the Austrian register-based census techniques to generate family statistics from administrative data sources were developed. The approach is based on data of relationships and how to handle them – in general, in a ﬁxed household and in an imputation process. Therefore, we combined algebraic, graph theoretical and statistical tools to construct a general framework.


Introduction
Household and family statistics as part of the population census have been generated in Austria since 1900.These statistics include information about types of households and families, as well as the specific status of a household and family member respectively (e.g.husband, wife, etc.).
However, the register-based census act of 16th March 2006 (cf.Registerzählungsgesetz 2006, §7 (1)-( 4)) stipulated that the population census should be conducted completely by using administrative data sources.Such register-based statistics have a long tradition in the Nordic countries (see United Nations 2007) and hold several advantages in comparison to classical surveys.For example, such a procedure is very cost efficient and there is no respondent burden anymore.On the other hand such a kind of census is a challenge in statistical methodology, i.e. data editing, coding variables, matching and derivating attributes and finally estimating missing values or objects.For a detailed description of the Austrian Census see Lenk (2009).
A general description of the register system and the methodological work can be found in Wallgren and Wallgren (2007).
Obviously, creating household and family statistics from administrative registers only, leads to several challenges.In a traditional census each member of a household has to fill out a questionnaire, which includes queries about the household status.Using that information it is -more or less -easy to deduce the type of household.For example if there is no relationship in the household, it is definitely a non-family household.On the other hand, in a register-based census the lack of relationships is not a sufficient criterion any more.Furthermore, a household with at least three persons can become implausible because of incorrect relationships (e.g. two partnership relations in a three person household).In summary, the challenge in household and family statistics is to detect implausible households and to estimate missing relations.
The purpose of this paper is to describe a framework for family generation via relationships developed by Statistics Austria.As preparation, we explain in Section 2 some basic definitions on households and families.After that we describe the available data, in particular the data about relationships.In Section 3, we look at the household level and show how plausibility is checked.Additionally, there is a description of an imputation process for relations.In Section 4, we finish with a short discussion on the quality assessment for the considered statistics and some closing remarks.

Family nucleus
Since we focus on a register-based census, a household is defined by the household-dwelling concept (see United Nations 2006), i.e. we consider all persons living in a housing unit to be members of the same household.We are interested in family statistics, so we limit our analyses to private households, whereas institutional households are not included in this analysis.
Child refers to a blood, step-or adopted son or daughter (regardless of age or marital status) who has usual residence in the household of at least one of the parents, and who has no partner or own child(ren) in the same household.Foster children are not included.
A family nucleus is defined as two or more persons who live in the same household and whose relationship is defined as either married or cohabiting partners, or as a registered same-sex couple, or as parent and child.
For these definitions and further information on households and families we refer to United Nations (2006).

Data sources
To derive a family nucleus, information on households, demography and relationships are needed.
Households in the Austrian census are generated by linking the Central Population Register (CPR) with the buildings and dwellings register (BDR).These registers contain the same addresses (numerical codes) for buildings, but not always the same information on door numbers and therefore dwellings.The BDR is highly reliable on building level.As far as dwellings are concerned, the linking of dwellings with people registered in the CPR is less successful due to some missing or wrong door numbers.In these remaining cases (about 1.9% of the Austrian population) additional sources are used to generate households, e.g.relationships.
The demographical information we need are sex, age and marital status.Further, a variable age at registration is needed, which can be derived from the date of registration in the CPR and the date of birth.
The basic data sources for relationships are: • Central social security register (CSSR) • Child allowance register (CAR) • Tax register (TR) The variable relationship occurs in more than one register (it is a so-called multiple attribute).
In the CSSR, people who are co-insured through a family member's national health insurance are included.
There is no extra notation for the opposite of the undirected relations Cou, Sib and 0. The set of relations and their opposite relations is denoted by R.
Remark.In the data sources CSSR and CAR there is a further type of relationship.The foster parent-foster child relation.This information can be used to ensure that a person can not be the child of such a household member (since they are in a foster parent-foster child relation).
Obviously, a valid relation requires two different persons and between those the relation has to be well-defined, i.e. exactly one type of relationship can be valid.Hence, one has to define rules for plausibility for each type of relationship.To illustrate this process of data preparation, it is briefly described here.Depending on the type of relation it must satisfy certain requirements on sex, age and marital status, respectively.As example, Statistics Austria use the following (Table 1) to check a Cou relation p 1 → p 2 between two persons p 1 , p 2 with age a 1 , a 2 (a 1 ≥ a 2 ), sex s 1 , s 2 and marital status m 1 , m 2 , respectively.If the relationship does not comply with those rules, it will be deleted.At present, if a relation in a source is not consistent with other sources, it will be deleted.It is also thinkable to develop certain rules to keep one of these relations.
The final step to prepare relationships is to derive new ones with the help of existing ones.
Here it is crucial that the relationships are archived (Statistics Austria has been collecting data on relationships since 2006).Let p 1 , p 2 , p 3 be pairwise distinct persons and assume that there are relations p 1 → p 2 , p 2 → p 3 and assume further that there exists no relation p 1 → p 3 .This relation p 1 → p 3 is then derived by composing p 1 → p 2 and p 2 → p 3 (see Table 2).
Table 2: Rules to derive new relations.
This can be important if the relations p 1 → p 2 , p 2 → p 3 do not exist any more in a household of the current population census because the involved person p 2 is absent, but p 1 , p 3 are still present in the same household and there exists no direct relation p 1 → p 3 .This derivation goes beyond the census population level.
That way, Statistics Austria obtains over 9 million relations for the register-based census 2011.

Households and graphs
Let n ∈ N. A (abstract) household H = (P, R) is a non-empty finite set P = {p 1 , p 2 , . . ., p n } of persons together with a set of relations R = {(p i → p j ) ∈ R| 1 ≤ i < j ≤ n}.Hence, a household has at most n 2 relations unequal to 0. To a household H we assign a simple graph G H by taking P as the set of vertices and {r ∈ R| r = 0} as the set of edges.A relation p i → p j induces a direction and a label (the type of relationship) to the corresponding edge.Since the labels Cou and Sib do not change if we reverse the direction, we skip their direction in G H . Hence G H is a simple, (partially) directed, labeled graph.
For the sake of completeness we define the household relationship matrix A = (a ij ) by

Algebraic structure of relations
We wish to define an operation • on R by the natural way of composition.Unfortunately, composition is not a well-defined operation on R, as the following example shows.
Example.Let p 1 , p 2 , p 3 be pairwise distinct persons in the same household and let p 1 → p 2 = C-P, p 2 → p 3 = Gp-Gc.Then there are two possibilities for the composition C-P • Gp-Gc.It seems natural that p 1 → p 3 is a P-C relation.Then p 1 , p 2 , p 3 are related in a direct-line as is shown in Figure 1a.But it is also possible that p 1 → p 3 is a 0 relation.This makes sense if we assume that there is an unknown person p 4 who is in P-C relation to p 3 and p 2 is in a P-C relation to p 4 .Then p 1 is the uncle/the aunt of p 3 , i.e. a 0 relation (see Figure 1b).So in general, the operation • is not well-defined which is caused by the fact that we involve relations between three generations.
However, there are at most two admissible values for a composition of two relationships.The first is always 0 and the second one depends on the composite relationships.Therefore, by defining the table of relationships operations (TRO) we label all of them by a.If there is one and only one admissible value, the composition will be unlabeled.
a There are two admissible values for the composition.The first is 0 and the second one is shown in the The first column in TRO represents p 1 → p 2 and the first row represents p 2 → p 3 .Additionally, ∅ means that there is no admissible value.
The Cou relation labeled by b in TRO requires special rules: We wish to calculate p 1 → p 3 = P-C • C-P, which should be Cou.But in this case, relations alone are not able to guarantee the truth.More precisely, we have to check sex, age and marital status respectively like in Table 1.If these conditions are not fulfilled, then the whole household is called implausible (see Section 3.3).
The algebraic structure on R defined by composition in TRO is not associative as one can see by
Plausibility conditions: Then we can define a new operation : R × R → R in the following way:

Now Statistics Austria uses the following approach:
1. Take pairwise distinct p 1 , p 2 , p 3 ∈ P .Check the plausibility conditions.If they are satisfied and t = 0, then do nothing.If they are satisfied and t = 0 and r s = 0, then replace t by r s.If they are not satisfied, stop and label H to be implausible.Do that for all pairwise distinct p 1 , p 2 , p 3 ∈ P .This approach overwrites the relations 0 ∈ R as long as new relations are derived and checks in addition if H is plausible.

Definition.
A household H is called plausible, if n ≤ 2 or H is not implausible by the approach.H is called complete, if no new relation = 0 can be derived.
Assume that H is plausible with size |H| > 1 and p ∈ P .The household status of p is . . . . . .partner if and only if there exists q ∈ P , such that p → q = Cou. . . .child (not of lone parent) if p is not a partner and there exists a partner q ∈ P , such that q → p = P-C. . . .child (of lone parent) if p is not a partner and there exists q ∈ P who is not a partner, such that q → p = P-C. . . .lone parent if p is not a partner and there exists a child q ∈ P , such that p → q = P-C. . . .not alone living otherwise.The household status implies the type of household and we get immediately the following sufficient criterion (this works well for the more detailed classification).
Criterion.In a strong-connected, plausible household, the type of household is uniquely determined.
Remarks. 1.Note that the criterion above is not a necessary one; e.g. the type of H = ({p 1 , p 2 }, {r}), r = 0 is uniquely determined, but H is not strong-connected for r ∈ {Gp-Gc, Gc-Gp, Sib}.The condition -strong-connected -is strict for n > 2, as we can see in the following.Take Then H 1 , H 2 are connected, plausible households of different types.In this case it is possible to get H 2 (Figure 2b) from H 1 (Figure 2a) by estimating r 1 = Cou, i.e. the type of household depends on the estimation process.
In practice, almost all connected households are strong-connected.
2. The type of household and the household status, respectively, are well-defined, which means that they are invariant under permutation of P , or equivalent, they just depend on the isomorphy class of G H .If H is implausible, one has to redefine at least one r ∈ R, r = 0 to r = 0.It is not easy and sometimes impossible to determine which relations should be redefined to 0 (e.g. two partnership relations in a three person household).However, there are several ways to generate a plausible household from an implausible one.Some of these are: − Redefine all relations in R to 0. − If G H is not connected, check (using the plausibility conditions) which connected component of G H is implausible and redefine all relations in that component to 0.

− Check if the household becomes plausible by redefining just one certain relation and if
there is no other relation with this property.
The occurrence of an implausible household is not very likely, i.e. in the Austrian registerbased census 2011, only about 0.05% of all private households with three or more persons are implausible.

Estimation of Relationships
From now on we assume that H is a complete plausible, not strong-connected household.As we have seen, in such a household we are no longer able to guarantee the type of household.Hence, we have to impute relations in H such that the estimated household H stays plausible.
Our imputation method is a combination of a hot-deck technique based on demographic characteristics together with an ordering relation based on normalized frequencies and some static rules involving date of registration and external household relations.Since we are interested in strong-connection we just estimate relations of types Cou, P-C.Note that the imputation of P-C types includes the estimation of C-P types via permutation of the persons concerned.Before we handle the general case, let us consider the special case of a two persons household.An overview about the general data work flow of the Austrian census with special focus on the imputation process can be found in Kausl (2012).
Let n = 2, H = ({p 1 , p 2 }, {0}) and a 1 , a 2 the age, a ∆ = a 1 − a 2 , s 1 , s 2 the sex, m 1 , m 2 the marital status of p 1 , p 2 .The variable parents indicates whether or not a person has at least a mother or a father in a separate household.For the variable ages at registration the date of the later registration of both persons was determined, then the minimum age of those persons at this date was computed.Relations are highly correlated with the age-difference a ∆ and with sex.To compute a probability distribution, depending on these variables and the relation type, we need a kind of non-relation to the relation type in question.Such non-relations imply the complementary probability.
Cou-distribution.Let s ∈ {male, female} and a ∆ ∈ Z arbitrary but fixed.To compute the complementary probability we define non-Cou relations as follows: Take a three-person households H with persons {p 1 , p 2 , p 3 }, such that p 1 → p 2 = Cou, p 2 → p 3 = 0 and the subhousehold {p 2 , p 3 } admits a Cou relation by demographical rules and a ∆ = a 2 − a 3 , s 2 = s.Since p 2 is already a partner, p 3 could not be a new one.Hence, p 2 → p 3 forms a non-Cou relation.Restrict the relations of types {P-C, Gp-Gc, Sib} in the stock of households to those which could be a Cou relation by demographical rules and which start with sex s and have age-difference a ∆ .This set together with the non-Cou relations forms the set of complementary events.Comparing these events with the real Cou relations in the stock of households (who start with sex s and have age-difference a ∆ ) leads to the probability distribution d(Cou, s, a ∆ ) of Cou.
P-C-distribution.Again, let s ∈ {male, female} and a ∆ ∈ Z arbitrary but fixed.Like the Cou-distribution, we compute the complementary probability by non-P-C relations defined as follows: Take a households H with persons P = {p 1 , p 2 , . . .}, such that p 1 → p 2 = 0, the sub-household {p 1 , p 2 } admits a P-C relation by demographical rules and a ∆ = a 1 −a 2 , s 1 = s.Further assume that there exists a person q, q ∈ P with sex s and relation q → p 2 = P-C (i.e.p 2 has at least a parent with sex s in a separate household).Then p 1 → p 2 forms a non-P-C relation.Restrict the relations of types Gp-Gc, Sib} in the stock of households to those which could be a P-C relation by demographical rules and which start with sex s and have age-difference a ∆ .This set together with the non-P-C relations forms the set of complementary events.Comparing these events with the real P-C relations in the stock of households leads to the probability distribution d(P-C, s, a ∆ ) of P-C.
Note that the result obtained that way heavily depends on the non-relations.Before one can use the distribution, one has to ensure that there are enough such non-relations.Perhaps one has to shrink the set of (complementary) events, e.g.restriction to households with n ≤ 3.
Further rules.An imputed relation p 1 → p 2 between two persons p 1 , p 2 with sex s 1 , s 2 , respectively has to fulfill rules like those presented in Table 1 (perhaps some of them with enlarged restrictions).Further rules for an imputed relation (see Table 4) include the variables parents and ages at registration.Ordering relation.If a household allows to impute a relation for which more than one type ( = 0) is possible or the household allows to impute two or more relations -which should be tried to be estimated first?The answer is crucial (in particular if n > 2), since an imputed relation affects the subsequent estimation procedure.Hence, we try to order the possible relations and types among themselves according to their probability, starting from the most probable.The easiest way to do that is to count the frequencies.Since we want to compare different types of relations and the number of P-C relations recorded by Statistics Austria is more then twice the number of Cou relations, we must normalize them.We take the whole stock of historicised relations as described in Section 2.3.Let N (Cou, s, a) be the number of all relations p 1 → p 2 = Cou, s 1 = s, a = a ∆ plus the number of all relations p 1 → p 2 = Cou, s 2 = s, a = −a ∆ for arbitrary but fixed s ∈ {male, female}, a ∈ Z.Further let N (P-C, s 1 , a ∆ ) be the number of all relations p 1 → p 2 = P-C.Then the relative frequency distribution f (r, s, a) of the relation r ∈ {Cou, P-C} with sex s 1 = s ∈ {male, female} and age-difference a ∆ = a is defined as a N (r, s, a) . (1) Now assume that p 1 → p 2 = r can be either of type r 1 = 0 or r 2 = 0, then we define the preordering relation r 1 r 2 if and only if f (r 1 , s, a) ≥ f (r 2 , s, a).1≤i<j≤m n i n j possibilities for choosing two persons p 1 , p 2 ∈ P , who are not in the same component (by ignoring the order of the components).These possibilities define a set of pairs {(p i 1 , p i 2 )| i = 1 . . ., N }.Each of these pairs (p i 1 , p i 2 ) can be viewed as a two-person sub-household of H to which we can apply the procedure above (an example is shown in Figure 3).A further reduction of N is possible if we delete all pairs (p i 1 , p i 2 ) which already define a relation of type Gp-Gc, Gc-Gp or Sib.Let s i , s j be the sex of the first involved person of r i , r j respectively and a ∆i , a ∆j respectively, the age-difference of the corresponding pair.Then (U, ) is a total preordered set, defined by ( r i , u) ( r j , u ) if and only if f (u, s i , a ∆i ) ≥ f (u , s j , a ∆j ).Now we are ready to carry out the following procedure: 1.For a plausible household H compute the set U .
2. If U = {} then stop, else take a (not necessary unique) maximal element ( r, u) ∈ U and generate a uniformly distributed random variable x between 0 and 1.
(a) If x ≤ d(u, s, a ∆ ), assign the type u to r.This assignment generates a new household H.
i.If H is plausible, then replace H with H and repeat step 1.
ii.If H is implausible, then remove r with type u from H, delete ( r, u) from U and repeat step 2. (since a one-person household obviously must be classified correctly), about 89% still match.
The classification rates for household types are listed in Table 5.

Closing remarks
As we have seen in Section 4, most of the families are verified by administrative data sources.Of course, it is unrealistic to have the measure µ equal 1 for all families, but one can try to improve the present quota.Obviously, this will happen in the future since the relations are archived.One can speed up this event by involving other data sources -for example, relations derived from a new central civil status register, which will be established at the ministry of interior in the year 2014.Other issues arise in some optional topics, such as foster children, stepchildren, reconstituted families etc.However, in a register-based census these topics are not easy to handle.

2.
Check whether a new relation = 0 in R has been derived by step 1.If no -stop.If yes -repeat step 1.
The following classification of private household by type is used in the Austrian Census 2011.−Marriedcouples without resident children − Married couples with at least one resident child under 25 − Married couples, youngest resident son/daughter 25 or older − Consensual union couples without resident children − Consensual union couples with at least one resident child under 25 − Consensual union couples, youngest resident son/daughter 25 or older − Lone father households with at least one resident child under 25 − Lone father households, youngest resident son/daughter 25 or older − Lone mother households with at least one resident child under 25 − Lone mother households, youngest resident son/daughter 25 or older − Two-or-more-family households − One-person households − Multi-person households (non-family) Furthermore, the classification can be enlarged for one-family households, if one distinguishes such households with or without non-family members (see United Nations 2006).

Figure 3 :
Figure 3: Possible relations in a certain household.
The kind of co-insurance implies the type of relationship.The CAR contains information about the parent-child relations for children up to 18, or if they are students, up to 27 years of age.Under certain conditions (e.g. if you get child allowance) you can request tax allowance by the federal ministry of finance.Parts of these records can be used to derive relationships.a person p 1 to a person p 2 is denoted by p 1 → p 2 .The opposite relation is denoted by p 2 → p 1 .The opposite of the directed relation P-C resp.Gp-Gc is denoted by C

table . b
The persons p1, p3 have to fulfill conditions on sex, age and marital status, respectively like in Table1.
cThe symbol ∅ means that there is no admissible value.

Table 4 :
Further rules for imputed relations.
Example.Let s 1 = male, s 2 = female and a 1 ≥ 37, a 2 = 20 and m 1 , m 2 = never married.Hence, if a ∆ = 17, we try to estimate a Cou relation first, whereas if a ∆ ≥ 18 we try to estimate a P-C relation first.To estimate a relation p 1 → p 2 = r in a household H = ({p 1 , p 2 }, {0}) a uniformly distributed random variable x between 0 and 1 is produced and assigned to (p 1 , p 2 ).If the type in question is r 1 and x ≤ d(r 1 , s, a ∆ ) then we accept r 1 .If not, we try whether we can estimate another type r 2 , r 2 r 1 for the relation r.Now we are able to enlarge this procedure to n ≥ 3 by estimating relations stepwise.Decomposition of H. Let H = (P, R) be a fixed plausible household with n > 3 persons.Then H is a disjoint union of the strong-connected components C 1 , . . ., C m of H (i.e. the maximal strong-connected sub-households of H), H = i C i .Let n i be the number of persons in C i , i = 1, . . ., m. Obviously, there are N =

Table 5 :
Classification rates by household types.Type of household 1 correct classification rates Married couples without resident children 96% Married couples with at least one resident child under 25 92% Married couples, youngest resident son/daughter 25 or older 87% Consensual union couples without resident children 85% Consensual union couples with at least one resident child under 25 96% Consensual union couples, youngest resident son/daughter 25 or older 78% Lone father households with at least one resident child under 25 70% Lone father households, youngest resident son/daughter 25 or older 54% Lone mother households with at least one resident child under 25 92% Lone mother households, youngest resident son/daughter 25 or older 90% Multi-person households (Non-family) 57% 1 The table does not include the type one-person households since in such households there is no estimated relation, but all of them must be correct classified.Further the type two-or-more-family households is omitted for the following reason: to prevent an overestimation of such households, we limited the number of families generated by estimated relations by one.So a classification rate cannot be computed for this household type.