Minimum distance index for bss, generalization, interpretation and asymptotics

We consider complex valued linear blind source separation, where the signal dimension might be smaller than the dimension of the observable data vector. In order to measure the success of the signal separation, we propose an extension of the minimum distance index and establish its properties. Interpretations for the index are derived through connections to signal-to-noise ratios and correlations. The interpretations are novel also for the real valued original case. In addition, we consider the asymptotic behavior of the extended minimum distance index. This paper is an invited extended version of the paper presented at the CDAM 2019 conference


Introduction
In several application areas, including, e.g., signal processing, economics and biomedical applications, it is assumed that one observes vectors that are linear mixtures of latent source variables.See, e.g., Comon and Jutten (2010) for an introduction.The corresponding model is referred to as linear blind source separation (BSS) model.Usually, the goal in BSS is to recover the latent unobservable variables.That is, the objective is to estimate the corresponding mixing (or unmixing) matrix.
Several algorithms for solving the linear BSS problem have been presented in the literature, see for example Ilmonen and Paindaveine (2011); Matteson and Tsay (2017); Miettinen, Illner, Nordhausen, Oja, Taskinen, and Theis (2016); Nordhausen (2014); Risk, Matteson, Ruppert, Eloyan, and Caffo (2014); Virta, Li, Nordhausen, and Oja (2017).As theoretical results, including asymptotic distributions, are often not presented, different estimators are compared in simulation studies.To enable comparisons, performance indices are needed.Popular indices considered in the literature include the Amari index, interference to signal ratio (ISR), mean square error (MSE) and the minimum distance index (MDI), see Nordhausen, Ollila, and Oja (2011) for a comparative study.MDI has the advantage of being affine invariant.Moreover, its asymptotic behavior is known in the case of real valued square mixing matrices, see Ilmonen, Nordhausen, Oja, andOllila (2010, 2012).In this paper, our focus is on the MDI for complex valued non-square matrices.This paper is an invited extension of the paper presented at the CDAM 2019 conference Lietzén, Virta, Nordhausen, and Ilmonen (2019).In the conference paper, we extended the MDI to the case where only some of the sources, the signals, are of interest to us, and the index measures how well these signals are recovered.Furthermore, in order to cover applications such as biomedical image processing, where the observed signals can be complex valued, we worked under the assumption of complex valued signals.A real valued version of this extension was proposed already in Virta, Nordhausen, and Oja (2016).However, no theoretical justification for its properties was given in Virta et al. (2016).Similarly, a complex valued version of the regular MDI (where no distinction between the signal and the noise was made) was introduced in Lietzén, Nordhausen, and Ilmonen (2016).In this paper, we review the results presented in the aforementioned conference paper Lietzén et al. (2019).In particular, Lemma 2, Theorem 3 and Theorem 8, and the corresponding proofs, were already given in Lietzén et al. (2019).In this paper, we extend the theoretical results of Lietzén et al. (2019) by considering the limiting distribution of the non-square complex valued MDI.Moreover, to highlight the applicability and interpretation of the MDI, we present an interesting new image data example.

Linear complex valued BSS model
Let x be an observable C p -valued random vector.Assume that, where Ω is a full-rank C p×p -matrix and the latent C p -vector z = z 1 z 0 , consists of two parts.In this linear complex valued BSS model, the C d -vector z 1 contains the signals of interest.Furthermore, the C d 0 -vector z 0 , d + d 0 = p, contains uninteresting noise.
Given Model (1), the objective is to find a C d×p transformation matrix Γ, such that the transformed vector corresponds to the signal variables, Γx = z 1 , up to some class of transformations, by using only the information contained in the observable x.Note that, usually the transformation matrix Γ is not unique.For example, in independent component analysis (ICA), where the latent signals are assumed to be stochastically independent, the transformation matrix is unique up to heterogeneous scaling, permutations and phase-shifts of the rows.Thus in general, we have no guarantees that two separate IC estimation procedures estimate the same population quantities, which is something we need to consider when measuring performances.The minimum distance index discussed in the next section solves the issue by measuring how close the corresponding gain matrix G = ΓΩ is to the matrix I d,p = I d 0 ∈ R d×p , up to heterogeneous scaling, permutations and phase-shifts of the rows.

Complex valued non-square MDI
In this section, we present the definition of the complex valued non-square MDI.We start by considering the concept of equivalent matrices.
Definition 1.Let ∼ be a relation on C d×p , defined by A ∼ B ⇐⇒ A = (PJD)B, for some D ∈ D d , J ∈ J d and P ∈ P d , where P d is the set of R d×d permutation matrices, D d is the set of R d×d diagonal matrices with strictly positive real valued diagonal entries and J d is the set of C d×d diagonal matrices with diagonal entries of the form exp(iθ 1 ), . . ., exp(iθ d ), where i is the imaginary unit.We use C d to denote the set defined as where • F is the Frobenius norm.Note that the diagonal elements of matrices belonging to D d are in the open interval (0, ∞), and thus, the optimization problem does not necessarily have a minimizer.This leads us to define the shortest distance between the equivalence class C A and the matrix I d,p using infimum as follows, The squared distance [ (A, I d,p )] 2 can be converted into a more applicable form, see the following theorem.
A has at least one nonzero element in row j and A ˜jk = 0 if A has only zeros in row j.Then, the squared distance [ (A, I d,p )] 2 defined in Eq. (3), coincides with, where the trace of a non-square matrix is the sum of its main-diagonal elements.
Proof of Theorem 3. We find the greatest lower bound by allowing the matrix D to have zeros on the diagonal.We combine the optimization variables J and D by optimizing over a variable L ∈ L d , where L d is the set of all C d×d diagonal matrices.Since the Frobenius norm is orthogonally invariant, the corresponding objective function f can be reformulated as follows, Next, write A = V + iW and L = Q + iR, where Q, R, V, W have real valued entries.Now, since L is a diagonal matrix, we obtain the following form for f (P, L, I d,p ): In the formula above, we have applied the constraints that the off-diagonal elements of Q and R are zero.Thus, the only remaining constraint is that P is a permutation matrix.We proceed to verify the Karush-Kuhn-Tucker (KKT) necessary conditions.
Assume that A has rows that have at least one nonzero element.The d− rows that contain only zeros give no contribution to the objective function and we can without loss of generality permute A such that the first rows of A are the ones with at least one nonzero element.
The partial derivatives with respect to the first diagonal elements of Q and R are, from which we can derive the solution candidates Q jj and R jj , which are optimal by the convexity of the corresponding optimization problem, Since the last d− rows of A do not contribute to the objective function, we set Q jj = L jj = 0 and P jj = 1, when j > .Using the property V 2 jk + W 2 jk = |A jk | 2 and since a permutation matrix has exactly one nonzero element 1 in every row and column, we can reformulate the objective function for L = Q + iR as follows, Ãj,π(P,j) such that π : P d × {1, . . ., d} → {1, . . ., d} : P × j → c j , where c j is the row-index for the only nonzero element of the permutation matrix P in the column j.An equivalent optimization problem is then to find a permutation for the rows of Ã, such that the sum of the maindiagonal elements is maximized, that is, min As in Ilmonen et al. (2012), the trace maximization problem in Theorem 3 can be seen as a linear sum assignment problem (LSAP).It can be solved, e.g., by the Hungarian method Papadimitriou and Steiglitz (1982).In practice, the LSAP can be solved using, e.g., the solve_LSAP function given in the statistical software R, R Core Team (2018), package clue, Hornik (2005).
We next present the minimum distance index (MDI) for non-square complex valued matrices.Note that this is an extension to the cases presented in Ilmonen et al. (2010Ilmonen et al. ( , 2012)); Lietzén et al. (2016).
Definition 4. Let Ω be the mixing matrix in Model (1), let Γ be a corresponding unmixing estimate and let Ĝ = ΓΩ be C d×p -valued.The minimum distance index (MDI) for the estimate Γ is given by, Remark 5. Note that, the non-square MDI is applicable beyond Model (1).It can, for example, be applied for assessing the performance when the mixing matrix Ω ∈ C p×q , q ≤ p.
Note that, in order to apply the MDI, the mixing matrix Ω needs to be known.We next show that MDI satisfies properties that enable interpretations and fair comparisons., if G has at least one nonzero element in row j and Gjk = 0 if G has only zeros in row j.By Theorem 3, we have that MD 2 (A) = 1 − 1 d max P {Trace(P G)}, where the elements of G are between zero and one.Hereby, the minimal value for MD 2 (A) is obtained, when P G is an identity matrix, and consequently, Trace(P G) = d.In addition, the maximal value for MD 2 (A) is obtained, when the first d columns of G contain only zeros, and consequently, Trace(P G) = 0. Thus, it follows that MD(A) ∈ [0, 1].Next, we proceed to verify properties (i)-(iii).
First, assume that G ∼ I d,p .This gives us that G is equal, up to a permutation, to I d,p .The maximal trace is then achieved by the permutation that places the only nonzero elements 1 of the matrix G to the main-diagonal.Thus, max P {Trace(P G)} = d, which implies MD(A) = 0. Next, assume that MD(A) = 0.This assumption gives us that max P {Trace(P G)} = d.The corresponding trace can be d only when there exists an element Gjk in every row j such that k ≤ d and Gjk = 1.Hereby, there has to be exactly one nonzero element in each of the first d columns of G and the nonzero elements have to be on different rows.Consequently, G is equivalent to I d,p .Thus, property (i) holds.
For property (ii), first assume that G = 0 B .Then, the main-diagonal of Gjk contains only zeros, regardless of the permutation and regardless of the matrix B. Thus, max P {Trace(P G)} = 0, which gives us that MD(A) = 1.For the only if part, assume that MD(A) = 1.This assumption gives us that Trace(P G) = 0 for the optimal P and consequently for every other P as well.Thus, G = 0 B , where B ∈ C d×(p−d) is arbitrary.Hereby, property (ii) holds.
For the final property (iii), assume that 0 ≤ c 1 ≤ c 2 ≤ 1.The requirement that the absolute values of the off-diagonal elements of A are less than one ensures that the permutation matrix that maximizes the trace is the identity matrix.Then, that is, the function f is increasing under the given conditions.
Remark 7. Note that Theorem 6 and its proof are very similar to (Lietzén et al. 2019, Theorem 6).However, (Lietzén et al. 2019, Theorem 6) contains an error that was corrected in Theorem 6 and its proof.

Two interpretations for MDI
Theorem 6 reveals that the endpoints of the MDI-range [0, 1] correspond to completely successful and unsuccessful signal separations, respectively.However, it is not entirely clear how much the separation result improves when the index decreases within the interval, say, from 0.2 to 0.1.To better interpret the MDI-values in practice, we next express MDI through two better-known measures of estimation accuracy, signal-to-noise ratio (SNR) and correlation.
Assume Model (1) and let, without loss of generality, the identifiability constraint Cov(z) = I p hold.Let further Γ ∈ C d×p be an estimated unmixing matrix for the d signals of interest, z 1 = (z 1 , . . ., z d ).The estimates of the signals are then ẑ1 = (ẑ 1 , . . ., ẑd ) = Ĝz, where Ĝ = ΓΩ ∈ C d×p is the gain matrix.We define the SNR and the absolute correlation of the jth signal to be, and The above form of SNR is as defined in Gonzales and Woods (2001).Both SNR j and |Cor j | measure the accuracy with which we estimate the jth signal.They have the respective ranges [0, ∞) and [0, 1], with the upper endpoints corresponding to a perfect estimation of the jth signal.
Our next result establishes a connection between MDI, SNR j and |Cor j | under two constant contamination models for the gain matrix Ĝ: a homogeneous model where the contamination is equal for each of the elements, and a heterogeneous model where the contaminations are allowed to differ element-by-element but (for mathematical tractability) we require the diagonal elements of Ĝ to be uncontaminated.These contamination models mimic situations, where there are estimation errors of different severities in the estimation of the unmixing matrix.
ii) Under the heterogenous contamination, G ˆ= I d,p + B, where B ∈ C d×p has zero main diagonal and Proof of Theorem 8.Under the homogeneous contamination, we have that MD 2 ( Ĝ) = 1 − Similarly, the second-order statistics Var(z j ) = 1, Var(ẑ j ) = |1 + ε| 2 + (p − 1) |ε| 2 and Cov(ẑ j , z j ) = 1 + ε imply that the squared absolute correlation for the jth signal has the value concluding the proof for the homogeneous contamination.
The claim is shown for the heterogeneous contamination in exactly analogous manner, and we list only the values of some of the key quantities.The squared MDI is now MD , and, using the form (2) for the signal-to-noise ratio, we obtain 1 − ( , which together prove the result for the SNR.For the correlation, we have the values Var(z In Table 1, we demonstrate the results of Theorem 8 under the heterogeneous contamination (where the conversion formula does not depend on the dimensionality p), with the SNR-values expressed in the standard decibel scale through the transformation SNR → 10 log 10 (SNR).
Based on the table, the value MDI = 0.1 is already an indicator of a very satisfactory separation, corresponding to an average SNR of 20 dB and to an average absolute correlation of 0.995.Next, we visually demonstrate the MDI in the context of image separation.We consider four unmixed color images, presented in Figure 1.The example was conducted using the statistical software R, R Core Team (2018).We have chosen images with colors on the surface of the RGB cube, that is, at least one of the three values that define the color of a single pixel has either the maximal or minimal value.We can then find a transformation between the RGB cube surface and the complex plane.The transformation is not defined on a single point and is bijective everywhere else.The images in this example are chosen such that none of their pixels are on the non-bijective point.We have then homogeneously contaminated the images and calculated the corresponding MD index values for them.As an example, we present the contaminated second image, with the corresponding MDI values, in Figure 2. The contamination is conducted as in Theorem 8 part i), which ensures that the optimal permutation is the identity matrix, and thus the order of the images remains unchanged.
As the contamination is homogeneous, the omitted images illustrate similar results.Note that the actual color in Figure 2 does not play a role in calculating the MDI as the color corresponds to the phase-shift and MDI is invariant with respect to scale, complex phase and order of the components.

Asymptotic behavior
In this section we provide the limiting distribution of the minimum distance index under general settings.In addition, we provide an example of the asymptotic properties, when we have √ n-consistency and asymptotic normality.
For an n-indexed sequence B(n) of C d×p -matrices, we use B(n) = O p (1) to denote that the univariate sequences that correspond to the elements of B(n) are uniformly tight, i.e., bounded in probability.Additionally, we use B(n) = o p (1) to denote that a sequence converges in probability to a zero matrix.
Under general settings, finding the asymptotic distribution of the MDI can be complicated.
In the following theorem, we present a direct asymptotic link between a sequence Â(n) and the squared distance [ ( Â(n) , I d,p )] 2 .By applying the theorem, one can usually significantly simplify the problem of finding the asymptotic distribution of the corresponding MDI.
Theorem 9. Let A ˆ(n) be an n-indexed sequence of C d×p -matrices that satisfies γ n (A ˆ(n) − I d,p ) = O p (1), where (γ n ) n∈N is a real-valued increasing sequence that tends to infinity as n → ∞.Then, as n → ∞, where (A ˆ(n) , I d,p ) is defined in Eq. (3) and the function off(•) sets the main-diagonal elements of its argument equal to zero.
Proof of Theorem 9. Note that, the assumption In addition, since the number of possible permutation matrices is finite, we have that γ n ( P(n) − I d ) = o p (1).For additional details, see the beginning of the proof of (Ilmonen et al. 2012, Theorem 4.2), and note that similar arguments can be applied here.
We can then apply the following factorization, (5) In the following, we denote, âjk := [ Â(n) ] j,k , Ŝj := p k=1 |â jk | 2 and pjk := [ P(n) ] j,k .Furthermore, we denote the complex conjugate of a as a * .Note that Ŝj = 1 + o p (1), and that γ n |â jk | 2 = γ n (â jk â * jk ) = o p (1), when j = k.Recall the form of the entries of the matrix L(n) from Theorem 3. The main-diagonal element (j, j) of Eq. ( 5) can then be expressed as The above representation uses the formulation for L(n) that assumes at least one nonzero entry in every row of Â(n) .As we have that Â(n) converges in probability to I d,p , and since the theorem is an asymptotic result, it suffices to only consider the case of every row containing at least one nonzero element.
Thus, by using the continuous mapping theorem and the properties of the Frobenius norm, Hereby, the asymptotic distribution of the squared MDI is obtained by scaling both sides of Theorem 9 with 1/d.Remarkably, Theorem 9 gives that the asymptotic distribution of MDI depends only on the off-diagonal elements of the sequence Â(n) .
We next present the widely considered case when γ n = √ n and when Â(n) is asymptotically normal.

Lemma 2 .
The relation ∼ is an equivalence relation on C d×p .We use the equivalence relation ∼ to partition C d×p into equivalence classes and denote the set of matrices that are equivalent to the matrix A as C A = {B ∈ C d×p | A ∼ B}.We next consider the shortest squared distance between the equivalence class C A and the matrix I d,p = I d 0 ∈ R d×p .The corresponding optimization problem is, minimize PJDA − I d,p 2 F s.t.P ∈ P d , J ∈ J d and D ∈ D d ,

.
and D d is the set of R d×d diagonal matrices with non-negative diagonal entries.We can then write,Â(n) , I d,p = inf C∈C d C Â(n) − I d,p F = Ĉ(n) Â(n) − I d,p FSimilarly as in the proof of Theorem 3, we can split Ĉ(n) into a permutation matrix part P(n) ∈ P d and a diagonal matrix part L(n) ∈ L d , such that Ĉ(n) = P(n) L(n) .Since Â(n) converges in probability to I d,p , we obtain that Ĉ(n) − I d = o p (1), see(Arcones 1998, Theorem 1).This further implies that P(n) − I d = o p (1) and L(n) − I d = o p (1).

Table 1 :
The average SNRs and absolute correlations for particular MDI-values.