Distances Based on the Perimeter of the Risk Set of a Testing Problem

At the core of this paper is a simple geometric object, namely the risk set of a statistical testing problem on the one hand and f -divergences, which were introduced by Csiszár (1963) on the other hand. f -divergences are measures for the hardness of a testing problem depending on a convex real valued function f on the interval [0,∞). The choice of this parameter f can be adjusted so as to match the needs for specific applications. One of these adjustments of the parameter f is exemplified in Section 3 of this paper. There it is illustrated that the appropriate choice of f for the construction of least favourable distributions in robust statistics is the convex function f(u) = √ 1 + u2− (1+u)/ √ 2 yielding the perimeter of the risk set of a testing problem. After presenting the definition, mentioning the basic properties of a risk set and giving the integral geometric representation of f -divergences the paper will focus on the perimeter of the risk set. All members of the class of f -divergences of perimeter-type introduced and investigated in Österreicher and Vajda (2003) and Vajda (2009) turn out to be metric divergences corresponding to a class of entropies introduced by Arimoto (1971). Without essential loss of insight we restrict ourselves to discrete probability distributions and note that the extension to the general case relies strongly on the Lebesgue-Radon-Nikodym Theorem. Zusammenfassung: Den Kern dieses Artikels bilden einerseits ein einfaches geometrisches Objekt, nämlich die Risikomenge eines statistischen Testproblems, und andererseits die von Csiszár (1963) eingeführten f -Divergenzen. Letztere sind Größen, welche die Schwierigkeit eines Testproblems messen und die durch eine konvexe reellwertige Funktion f auf dem Intervall [0,∞) parametrisiert sind. Die Wahl des Parameters f kann den Bedürfnissen spezifischer Anwendungen angepasst werden. Eine von diesen Anpassungen des Parameters f wird in Abschnitt 3 dieses Artikels beschrieben. In diesem wird nämlich illustriert, dass es für die Konstruktion von ungünstigsten Verteilungen in der robusten Statistik zweckmäßig ist, als Parameter die konvexe Funktion f(u) = √ 1 + u2 − (1 + u)/ √ 2 zu wählen, welche den Umfang der Risikomenge des Testproblems liefert. 1Dedicated to the Memory of Igor Vajda (1942-2010) 4 Austrian Journal of Statistics, Vol. 42 (2013), No. 1, 3–19 Nachdem Definition und grundlegende Eigenschaften der Risikomenge eines Testproblems gegeben und die integralgeometrische Darstellung von f -Divergenzen präsentiert werden, konzentriert sich der vorliegende Artikel auf den Umfang der Risikomenge. Alle Elemente der Klasse von f -Divergenzen vom Umfangstyp, welche in den Arbeiten von Österreicher and Vajda (2003) und Vajda (2009) eingeführt und untersucht werden, stellen sich als metrische Divergenzen heraus, die einer von Arimoto (1971) eingeführten Familie von Entropien entsprechen. Ohne Verlust von Einsicht beschränken wir uns hier auf diskrete Wahrscheinlichkeitsverteilungen und merken an, dass die Fortsetzung auf den allgemeinen Fall auf dem Satz von Lebesque-Radon-Nikodym beruht.


Introduction
What is basic for this paper is a (simple versus simple) testing problem (P, Q), which is a pair of probability distributions P and Q defined on a set Ω = {x 1 , x 2 , . . . } of at least two elements.
In Section 2 the central entity of this paper, namely the risk set of a testing problem, and its properties are presented: Let A ⊆ Ω be a (nonrandomized) test. Then the convex hull of all pairs (P (A), Q(A c )), A ⊆ Ω, of the probabilities P (A) and Q(A c ) of type I and type II error satisfying P (A) + Q(A c ) ≤ 1 is the risk set R(P, Q) of the testing problem (P, Q). Expressed colloquially, its essence may be summarized as follows: The 'bulkier' its risk set the 'easier' the testing problem. Section 3 is devoted to f -divergences. Subsection 3.3 contains their precise definition and their basic properties.
Motivation 1: The most widely used measure of the deviation of two probability distributions in statistics is Pearson's χ 2 -divergence Another well-known measure of deviation, originally designed for applications in cryptanalysis and later used both in information theory and statistics is the Ior Kullback-Leibler divergence Another measure of deviation is the total variation distance ||Q − P ||/2 = 1 2 ∑ x∈Ω |q(x) − p(x)| .

5
These and many other measures of deviation of two probability distributions are special cases of f -divergences defined in terms of a convex function f : [0, ∞) → R continuous at 0 and introduced by Csiszár (1963). So χ 2 -, I-divergence and total variation distance are the f -divergences with the convex function f (u) = (u − 1) 2 , f (u) = u log u and f (u) = |u − 1|/2, respectively. Subsections 3.1 and 3.2 deliver the integral-geometric approach to f -divergences which is based on the risk set of a testing problem: The 'Representation Theorem', the main result of this subsection, states that every f -divergence is a certain way of measuring the 'bulkiness' of the risk set. The most natural measure of its 'bulkiness' is perhaps its perimeter: It turns out that the latter is the f -divergence given by the convex function More generally, the 'parameter' f of an f -divergence is nothing but a certain function for the weights of the breadths of the corresponding risk set measured for all different directions. By the way, the area of the risk set -the well-known Gini coefficient -is not an f -divergence. Subsection 3.4 is entitled 'Metric f -Divergences'.
Motivation 2: Pearson's χ 2 -divergence χ 2 (Q, P ) is obviously not apt to define a metric divergence. However, the square root of its symmetrized version is a metric. This divergence, which is defined in terms of the convex function f (u) = (u−1) 2 2(1+u) and which goes back to Sanghvi (1953) is studied in detail by Puri and Vincze (1988). Thus, from the mathematical point of view it is very natural to ask which properties of a convex function f are sufficient for an f -divergence to be a metric divergence. Theorems 4 and 5 answer this question.
Remark 1: Metric divergences do not only fulfil the symmetry property i.e. that f is * -self conjugate, but also the property Austrian Journal of Statistics, Vol. 42 (2013), No. 1, 3-19 • establishing bounds for the error probabilities of a sequence (P n , Q n ), n ∈ N, of testing problems (i.e. in case of n iid observations with distribution P and Q, respectively) and • the characterization of an entirely separated sequence P n , Q n , n ∈ N, of probability distributions (cf. e.g. Chapter 11. Various statistical applications of f -divergence in Vajda, 1989).
Section 4 is devoted to robust testing: We are given a simple versus composite testing problem (P, Q), where is the set of all probability distribution with total variation distance ≤ ε from a given probability distribution Q and let P / ∈ Q. A least favourable distribution is a distribution Q * ∈ Q, which is 'closest' to P . In early papers on robust testing least favourable distributions were characterized by f -divergences. In this section, we show how to construct a least favourable distribution Q * . This is done in geometric terms: We construct the intersection of risk sets and select the element Q * ∈ Q which satisfies R(P, Q * ) = R(P, Q). Therefore, in order to construct least favourable distributions the appropriate choice of the convex function f of the f -divergence is the one which gives rise to the perimeter of the risk set of the testing problem, i.e. the one given by (1). f -divergences are obviously used in several areas of statistics, furthermore in proving limit theorems in probability theory, in analyzing the limiting behavior of Markov chains, in information theory and quantum physics.
Section 5 is devoted to the class of f -divergences of perimeter type, introduced and studied inÖsterreicher and Vajda (2003) and Vajda (2009). It is based on the class of entropies due to Arimoto (1971) and contains, next to the f -divergence given by (1), the total variation distance and a symmetrized version of the I-divergence also the squared Hellinger distance (with f (u) = ( √ u − 1) 2 ) and the squared Puri-Vincze distance. All f -divergences of this class are metric divergences.

Risk Sets
Let Ω = {x 1 , x 2 , . . . } be a set with at least two elements, P(Ω) the set of all subsets of Ω and P the set of all probability distributions P = (p(x) : x ∈ Ω) on Ω.
A pair (P, Q) ∈ P 2 of probability distributions is called a (simple versus simple) testing problem. A subset A ⊂ Ω is called a (simple) test. It is associated with the following decision rule: one decides in favour of the hypothesis Q if x ∈ A is observed and in favour of P if x ∈ A c = Ω\A is observed.
Then P (A) and Q(A c ) is the probability of type I error (probability of a decision in favour of Q although P is true), and the probability of type II error (probability of a decision in favour of P although Q is true), respectively.
Two probability distributions P and Q are called orthogonal (P ⊥Q) if there exists a test A ⊂ Ω such that P (A) = Q(A c ) = 0. (In this extreme case only one observation 7 is needed to decide between P and Q and the probabilities of committing both errors vanish.) A testing problem (P, Q) ∈ P 2 is called least informative if P = Q and is called most informative if P ⊥Q.
Let 0 ≤ π < 1 and let (π, 1 − π) be a prior distribution on the set {P, Q} ⊂ P associated with the testing problem (P, Q). Then the quantity is called Bayes risk of the test A with respect to the prior distribution (π, 1 − π). Since the Bayes risk enables us to order the pairs (P (A), Q(A c )), A ∈ P(Ω) of error probabilities, it is straightforward to ask for tests which provide the minimal Bayes risk. In fact, as can be easily checked, it holds where the two latter terms are nonnegative and vanish iff In order to summarize let t = π 1−π , A t = {q > tp}, A + t = {q ≥ tp} and let b t (Q, P ) = ∑ x∈Ω min(q(x), tp(x)) be the (1 + t)-multiple of the minimal Bayes risk with respect to the prior distribution ( t 1+t , 1 1+t ). Then Definition 1: Let (P, Q) ∈ P 2 be a testing problem. Then the set is called the risk set of the testing problem (P, Q), whereby co stands for 'the convex hull of'. The geometric object of the risk set R(P, Q) provides a qualitative measure for the deviation of P and Q. In fact, the family of risk sets define a uniform structure on the set P (cf. Linhart andÖsterreicher, 1985). Statistics, Vol. 42 (2013), No. 1, 3-19 holds with equality iff P = Q and P ⊥Q, respectively.

Properties of Risk Sets
(R2) Let t ≥ 0 and b t (Q, P ) be the (1 + t)-multiple of the minimal Bayes risk with respect to the prior distribution ( t 1+t , 1 1+t ). Then the risk set R(P, Q) of a testing problem is determined by its family of supporting lines from below, namely Consequence of (R2): Let (P, Q) and (P ,Q) be two testing problems. Then Simple Example (Testing a fair tetrahedron versus a biased one): Although the number of simple tests for a set Ω with m elements is |P(Ω)| = 2 m , we need only m + 1 pairs (P (A), Q(A c )), A ∈ P(Ω) in order to determine the risk set R(P, Q) economically. It is advisable to proceed as follows: Order the set Ω so that the likelihood ratios are decreasing, i.e.
take the tests . . , m}} of the pairs of error probabilities and form the convex hull co(S) of this set. Then co(S) = R(P, Q).
For our example the tests A i and the corresponding pairs (P (A i ), Q(A c i )) of error probabilities are given in the following table.
Remark 2: For the special case P = ( 1 m , . . . , 1 m ) and Q = (q 1 , . . . , q m ), such that q 1 > q 2 > · · · > q m , the lower boundary of the set is the so-called Lorenz curve. It was already used by Lorenz (1905) in order to measure the inequality of the distribution of wealth within a given population. The translation of the following quotation from Lorenz' paper into our context describes exactly the purpose of the risk set.
"We wish to be able to say at which point a community is placed between the two extremes, equality on the one hand, and the ownership of all wealth by one individual on the other." 3 f -Divergences

Geometric Approach
In order to define a quantity for the 'hardness' of a testing problem (P, Q) we proceed, after the qualitative step which assigns the 'hardness' of a testing problem (P, Q) to the 'bulkiness'of the corresponding risk set R(P, Q) by a first quantitative step.
To this end let b t (Q, P ) be the (1 + t)-multiple of the minimal Bayes risk with respect to ( t 1+t , 1 1+t ) of the testing problem (P, Q) and let b t (P, P ) = min(1, t) be the corresponding quantity for the least informative testing problem (P, P ). Then the differences compare the 'bulkiness' of the risk set R(P, Q) with that of the risk set R(P, P ) = D of the least informative testing problem. The parameters t ≥ 0 are the absolute values of the slopes of the supporting lines of the risk set from below. In a second quantitative step weights for the parameters t ≥ 0 are assigned in terms of a suitable monotone function F : provides an essential extension of the above family of measures of the 'bulkiness' of the risk set. Due to the richness of the class of parameters F these weighted measures can be adjusted so as to match a given type of application.

The Perimeter of the Risk Set
In this subsection we are going to describe an interesting special case related to the wellknown fact from integral geometry that the perimeter of a finite convex subset of R 2 is the integral of its breadths. Since max(1, t) − b t (Q, P ) is the vertical part of the breadth of R(P, Q) in direction of the vector ( t 1+t , 1 1+t ) of the prior distribution, is its breadth. Since the breadth of the risk set with respect to φ ∈ [π/2, π) is (cos(π − φ) + sin(π − φ)) and ∫ π π/2 (cos(π − φ) + sin(π − φ))dφ = 2 the perimeter P er(R(P, Q)) of the risk set is P er(R(P, Q)) = whereby, by virtue of cos(φ(t)) dφ(t) dt = (1 + t 2 ) −3/2 , the density 1 [0,π/2) (φ) of uniform weight is transformed to the density (1 + t 2 ) −3/2 in the parametrization by t ∈ [0, ∞). Since the perimeter of the risk set R(P, P ) = D of the least informative testing problem is obviously P er(R(P, P )) = is the special case of our family of measures given by the density (1 + t 2 ) −3/2 . The above approach to define a family of measures of the 'hardness' of a testing problem, which stresses modelling, relies on the following representation theorem for so-called f -divergences I f (Q, P ) given by Feldman andÖsterreicher (1981). In this setting the weight function F introduced above is the right-hand side derivative D + f of a continuous convex function f on the interval [0, ∞).
Representation Theorem: In the following section we will present the original definition of f -divergences by Csiszár (1963), a number of examples and the basic properties.

Range of Values Theorem
In the first inequality, equality holds if/iff Q = P . The latter provided that (i) f is strictly convex at 1. In the second, equality holds if/iff Q⊥P . The latter provided that Characterization Theorem (Csiszár (1974)): Given a mapping I : P 2 → (−∞, ∞] then the following two statements are equivalent: ( * ) I is an f -divergence, i.e. there exists an f ∈ F 0 such that I(Q, P ) = I f (Q, P ) ∀(P, Q) ∈ P 2 ( * * ) I satisfies the following three properties: (a) I(Q, P ) is invariant under permutation of Ω.
be a partition of Ω and let be the restrictions of the probability distributions P and Q to A. Then (c) Let α ∈ [0, 1] and P 1 , P 2 and Q 1 , Q 2 probability distributions on Ω. Then

Metric f -Divergences
Let us now concentrate on those (further) properties of the convex function f which allows for metric divergences. As we know already I f (Q, P ) fulfils the basic property (M1) of a metric divergence, namely provided (i) f is strictly convex at 1.
In addition I f (Q, P ) is symmetric, i.e. satisfies It turns out that, in addition to the rather natural conditions (i) and (ii), the condition (iii) f (0) + f * (0) < ∞, which is used to characterize Q⊥P , is crucial for metric divergences. However, since it cannot be expected in general that an f -divergence fulfils the triangle inequality we have to look for suitable powers to do so.
From the following two theorems given in Kafka,Österreicher, and Vincze (1991) Theorem 4 offers a class (iii, α), α ∈ (0, 1] of conditions which are sufficient for guaranteeing the power [I f (Q, P )] α to be a distance on P. Theorem 5 determines, in dependence of the behaviour of f in the neighbourhoods of 1 and of g(u) = f (0)(1 + u) − f (u) in the neighbourhood of 0, the maximal α providing a distance.
Theorem 5: Let (i) and (ii) hold true and let α 0 ∈ (0, 1] be the maximal α for which (iii, α) is satisfied. Then the following statement concerning α 0 holds. If for some k 0 , k 1 , c 0 , c 1 ∈ (0, ∞) Finally we present a version of the refinement of the Range of Values Theorem which matches the assumptions (i), (ii) and (iii) which are necessary to allow for metric divergences.
Remark 5: Note that this theorem implies that any metric defined in terms of an fdivergence is equivalent to the total variation distance. Huber and Strassen (1973) proved the existence of least favourable pairs of distributions for composite versus composite testing problems under the assumption that both hypotheses are majorized by two-alternating capacities and characterized them in terms of f -divergences with strict convex functions f . The author restated the definition of least favourable pairs in terms of risk sets and demonstrated (1982) that their perimeter can be used to construct least favourable pairs. For further references in this context see e.g. Osterreicher (1983).

Construction of Least Favourable Distributions
For an application of the perimeter of the risk set for goodness of fit tests see Reschenhofer and Bomze (1991).
be the risk set of a simple versus composite testing problem, which is a pair (P, Q) of an element P and a nontrivial subset Q of P. We will illustrate the construction of a least favourable distribution Q * ∈ Q for the simple case of a total variation neighbourhood.
Theorem 7: Let P, Q ∈ P and let Q = U (Q, ε), ε ∈ (0, 1) be a total variation neighbourhood of Q which does not contain P . Let furthermore R(P, Q) + (0, ε) be the risk set of the simple versus simple testing problem (P, Q) having been shifted upwards by the amount ε and let finally t < 1 <t be the absolute values of the slopes of the supporting lines onto R(P, Q) + (0, ε) through the points (1, 0) and (1, 0), respectively. Then the least favourable distribution Q * ∈ Q for (P, U (Q, ε)) is given by the censored version q * (x) = max(t · p(x), min(q(x),t · p(x))) of the density q.
Simple Example (Continuation): In order to illustrate Theorem 7 let us continue our simple example from Section 2 by replacing the distribution Q by the total variation neighbourhood When comparing the distribution Q in the center of the variation neighborhood Q = U (Q, 1/8) with the least favourable distribution Q * ∈ Q Q = (5/8, 1/4, 1/8, 0) Q * = (4/8, 1/4, 1/8, 1/8) notice that the probability 1/8 is shifted from the most probable element to the least probable.
If the distribution Q of income (with total amount 1) of a population of n individuals has to be redistributed so that the inequality in income is minimized under the constraint that the portion of income of no group of the population is cut or raised more than ε, one has to proceed as follows: If a person's income exceeds a certain amountt/n, her or his income has to be cut to this bound. The total amount ε of income collected that way must be allotted to those persons, whose income is smaller than a certain lower bound t/n so that every person is guaranteed the minimal income t/n.
The principle of income transfer was first clearly described by Dalton (1920) as follows: "If there are only two income receivers and a transfer of income takes place from the richer to the poorer, inequality is diminished. There is, indeed, an obvious limiting condition. The transfer must be so large as to more than reverse the relative position of the two income receivers, and it will produce its maximum result, that is to say, create equality, when it is equal to the half of difference between the two incomes. And, we may safely go further and say that, however great the number of income receivers and whatever the amount of their incomes, any transfer of the two of them or, in general, any series of such transfers, subject to the above condition, will diminish inequality. It is possible that, in comparing two distributions, in which both the total of the income of the number of the income receivers are the same, we may see that one might be able to be evolved from the other by means of a series of transfers of this kind. In such a case we would say that the inequality of one was less than that of another."

Divergences of Perimeter-Type
If both the arc length of the lower boundary of the risk set and the diagonal D are measured in terms of the l p -norm in R 2 then the ordinary case (p = 2) can be extended to the