A Metropolis-Hastings Sampling of Subtrees in Graphs

This article presents two methods to sample uniform subtrees from graphs using Metropolis-Hastings algorithms. One method is an independent Metropolis-Hastings and the other one is a type of add-and-delete MCMC. In addition to the theoretical contributions, we present simulation studies which con-ﬁrm the theoretical convergence results on our methods by monitoring the convergence of our Markov chains to the equilibrium distribution.


Definitions
A graph is a heavily used data structure in the world of algorithms and there are numerous applications of it in computer science like networks of communication, data organization, computational devices and the flow of computation.Graphs have proven particularly useful in linguistics in addition to the use of them in chemistry, physics, sociology and biology.Let G(V, M ) be a graph consists of V as its finite set of vertices and M is a finite set of ordered pairs of the form (u, v) called edges.The graph is directed if the pair (u, v) is not the same as (v, u), where (u, v) indicates that there is an edge from vertex u to vertex v, while it is undirected if both pairs represent the same edge.Also, a graph is called regular if each vertex has the same number of neighbours; i.e. every vertex has the same degree.A special case of regular graphs is called complete graphs where any vertex in the graph is connected with all other vertices.A graph is connected if for every pair (x, y) of distinct vertices there is a path from x to y.A graph without any cycle is a forest; a tree T (V T , M T ) is a connected forest where the order of a tree is its number of vertices |V T | and the tree size is its number of edges |M T |.For a rigorous presentation of graph properties we recommend (Bollobás 1979) and (Bollobás 2001).Here we consider the uniform distribution over trees T (V T , M T ) verifying card(V T ) = k + 1.

Literature review
Many applications are generating very huge graphs with thousands and millions of vertices.These large graphs are challenging to study because they need to run expensive algorithms such as simulations in addition to that it is hard to get an impression of the graph topology from visualization.Unfortunately, such activities on large graphs usually cost a lot of time that is computed at least polynomially in the number of vertices.One way to reduce the runtime is to reduce the graph size by sampling a structure from the large graph which approximates the original graph well.Graph mining is a branch of data mining which is used for mining graph structures (Rehman, Khan, and Fong 2012).It has gained much attention during the last years and finds its applications in many domains like: social and computer networks, bioinformatics and chemistry.In the literature, various approaches for graph mining have been proposed for classification, clustering and sampling.In general, the graph sampling methods, which are our concern, are divided into three categories (Ahmed, Neville, and Kompella 2011): node sampling, edge sampling and topology-based sampling.Node sampling algorithm simply creates a representative structure by sampling the vertices uniformly where the edges between the sampled vertices in the large graph are considered as edges also in the sampled structure.A known related method is called random node-edge sampling (Hu and Lau 2013) where vertices are uniformly sampled and edges that are incident to these vertices are also uniformly sampled in the sample graph.Additionally, some node sampling methods also consider the neighbors of the sampled vertices like the random nodeneighboursampling method (Leskovec and Faloutsos 2006) where all the edges that are linked to the sampled vertices in the graph are sampled into the required structure.Moreover, many of those methods were developed to use the graph topology information by integrating with topology-based sampling methods like the random walk sampling (Yoon, Lee, Yook, and Kim 2007) and the Metropolis algorithm (Hübler, Kriegel, Borgwardt, and Ghahramani 2008), which replaces some sampled vertices with other vertices, sample structure with properties consistent with the graph.Similarly, edge sampling builds a subgraph by sampling edges randomly.For instance, in random edge sampling (Ebbes, Huang, Rangaswamy, and Thadakamalla 2008), the subgraph is built from edges sampled randomly and uniformly.Another modified edge sampling method is the induced edge sampling (Ahmed, Neville, and Kompella 2012) with both of its extensions, the totally induced edge sampling and the partially induced edge sampling.The first one applies the random edge sampling and obtains adjacent vertices from these edges, then all edges attached to those vertices are chosen to the sampled graph.In contrast, partially induced edge sampling applies the edge sampling where edges are sampled according to a probability.Various approaches that leverage tree mining algorithms were developed as well since trees are one of the most well studied probabilistic structures in graphs.Over the past decade, trees have found a surprising number of applications in Internet and computer science like, for example, XML which is a markup language designed to store and transport data.For instance, XML data is very popular because of the nature of its tree structure and as a result it is necessary to develop methods that can treat and extract patterns from this type of data like, for example, tracking down the common trees existing among a set of such data.Additionally, it has been heavily researched in biology and bioinformatic where trees are widely used to represent various biological structures like glycans, RNAs, and phylogenies.For example, Shapiro and Zhang (1990) studied the function of the RNA where the RNA structures were collected in trees in order to compare any newly sequenced RNA to compare the similarities in the topological patterns.Takigawa, Hashimoto, Shiga, Kanehisa, and Mamitsuka (2010) represented glycans as directed trees in which nodes are monosaccharides and edges are linkages and proposed an efficient method for mining frequent and statistically significant subtrees from glycan trees.For a good survey of trees mining algorithms in biology, see (Parthasarathy, Tatikonda, and Ucar 2010) The authors in a previous work (Hancock, Wicker, Takigawa, and Mamitsuka 2012) sampled pathways of genes within significantly coordinated expression profiles.They aimed at identifing the important metabolites which are driving the function of the network by comparing these metabolites to the metabolites in sampled pathways.As a consequence, they have been motivated to search for another bigger structure that can better identify these metabolites.Although there are many known methods in literature to sample from a graph, most are designed either to sample subgraphs that are representative to the original graph according to the graph properties as in the work of (Hübler et al. 2008) or to sample frequent structure patterns as in (Yan and Han 2002) which are not our concern in this work, in addition to the fact that using most of these techniques generally is not efficient to sample subtrees specifically because these subtrees will represent a small proportion of the sampled stuctures.Our contribution in this work is to present two techniques to sample trees according to a distribution from a graph where the vertices are labelled, i.e the tree structure, a.k.a pattern, is not taken into consideration in our method as well as the graph properties.

Sampling methods
Sampling uniformly subtrees of a given size that are obtained from an initial graph is not a trivial problem.A way to solve this problem is to use Markov chain Monte Carlo (MCMC) sampling (For general references on Markov chains and Markov chain Monte Carlo see (Kemeny and Snell 1983), (Robert and Casella 2009) and (Brooks, Gelman, Jones, and Meng 2011)).Here, for this purpose, we present a Metropolis-Hastings (MH) dynamics.The MH algorithm is an iterative procedure that simulates a Markov chain.If the simulated Markov chain is irreducible and aperiodic, as the configuration space of the chain is finite, the algorithm is convergent.More precisely, this means that the outputs of the algorithm are asymptotically distributed according to the unique invariant distribution of the simulated Markov chain (Häggström 2002).The principle of the MH algorithm is the following: let π be a distribution of interest and consider x t the current state of the chain.A new candidate x * is proposed according to a proposal distribution q(x t , x * ).The proposed candidate is accepted as the new state of the chain, i.e x t+1 = x * with probability : We present two methods for sampling uniformly trees with k edges from an undirected and unweighted graph.The first method samples uniform trees according to an independent Metropolis-Hastings, whereas the second method does it according to a non-independent Metropolis-Hastings algorithm.
Our Markov chain (X t ) is produced through the transition kernel: For the independent method, we use a special case of the Metropolis-Hastings algorithm where the candidate x * is independent of the present state of the chain x t so q(x * , x t ) = q(x * ) and the transition kernel: In what follows, first each method is detailed and then their convergence speeds are studied.

First method: independent uniform trees
In the following we present the proposal distribution for the independent sampler.This method generates the candidate tree x * in the following way.A vertex v i 1 is selected ran-domly from the original graph and receives a weight w i 1 = k, next this weight w i 1 is distributed among all neighbours in the graph so that each vertex receives a weight, then vertices with weight greater than 0 are selected as neighbours for the first vertex in the generated subtree.The weight k represents the number of vertices we can connect for the whole subtree after choosing a given vertex, this weight is distributed on the selected neighbours according to a uniform multinomial distribution with equal probabilities.This process is conducted iteratively until no weight is left anywhere.Algorithm 1 presents the detailed steps to generate a tree T , it is important to note that A is an ordered set where the last entered vertices are the smallest in the ordering sense.
Algorithm 1 Generate a random tree | be the neighbours of v in the graph not already selected in the tree if n(v) = ∅ and w(v) > 1 then the algorithm will stop and relaunch again from the beginning else the weight w To compute p(T ), the probability of generating a subtree T , start from any vertex v of T and compute the probability of generating the subtree T starting from it.Algorithm 2 allows to compute the probability of generating a subtree as detailed in its description.n T (v) denotes the neighbours of v in subtree T .Also, p(x * ) must be computed to take into account all the possible ways of generating x * .
Algorithm 2 Compute the probability p(T ) of generating The weights used to compute the probability of generating a subtree are computed as shown in algorithm 3. The chain is clearly irreducible as each subtree has a positive probability of being sampled at each step.This also implies the aperiodicity since it is always possible to stay at the same state.Bollobás (2001) showed that for r ≥ 2 and n ≥ 3, if Y i is the number of cycles of length at most i in a r−regular graph generated by the Erdős-Rényi random graph, then asymptotically independent Poisson random variables with mean λ i = (r − 1) i /2i.We assume henceforth that these random variables follow a Poisson distribution, indeed in practical applications the graph size is often large.Under these assumptions the following result gives a bound on the speed of convergence for the independent Markov chain sampler.
Theorem 1. Assume that Y k + 1 is a Poisson distributed random variable representing the number of cycles of length at most k + 1, then for a random r−regular graph in Erdős-Rényi random graph model, where the vertex degree r ≥ 2 and each vertex is contained in a cycle of length at least k + 1, we have for any starting state x: with a probability larger than 1 − α, where M m x is the m th updated distribution and π is the target distribution.
Proof.For the convergence, we can use the bound on the total variation distance: with u(1) = min(p(x)/π(x)) (Liu 1996).As π(x) is the inverse of the number of subtrees, we need a lower bound on the number of subtrees of a given size k in a graph.
The number of subtrees can be bounded considering that a tree is obtained whenever a cycle of length k + 1 is deprived of an edge, in this way we can obtain k + 1 different subtrees of size k.Moreover, if we can have c such cycles we are able of generating (k + 1)c subtrees, this is the way we follow to lower bound the number of subtrees although it is a fact that the true number of subtrees in a graph is bigger than the number of subtrees generated in our way.Let Y k +1 denote the number of cycles of length at most k+1.Let us denote by F the following event: then we can conclude that with probability larger than 1 − α: Let S denote the set of all possible subtrees of size k that can be obtained from our graph.A given tree of size k can be generated for the proposal distribution in k + 1 different ways when starting from any vertex in the tree.Additionally, each way involves l steps (l ∈ 1, . . ., k) representing number of multinomial steps, at least 1 if all weights are distributed at once in a star-like manner, and at most k if weights are given every time to a single vertex producing thus a path of length k + 1.As a consequence, each subtree has a proposal probability of the form: Thus the probability p(T ) of generating a tree is lower bounded by: Gathering results of equations 3 and 4 we obtain: Clearly, this bound goes to 0 while assuming that the graph is large; i.e the number of vertices |V | = n is large.We assumed that each vertex is contained in a cycle of length k + 1 to guarantee that the proposed subtree x * can be obtained from any of its vertices.In general, the success rate s of algorithm 1 will be strictly lower than 1.It is worth to mention that s is not the Metropolis-Hastings acceptance rate to avoid misleading interpretations.More precisely, algorithm 1 failure results from a shortage of the number of required neighbours for the selected vertex in comparison to the given weight of this vertex that will be distributed among neighbours which will cause algorithm 1 to stop and start again from the beginning.The success rate could be estimated as s ≈ (number of acceptances)/(number of steps) which is normally close to 1 unless the border is large as on expander graphs for example.
Theorem 2. Let Y k + 1 be the same random variable defined in Theorem 1 and s represents the algorithm success rate, then for a random r−regular graph, where r ≥ 2, we have for any starting state x: with a probability larger than 1 − α, where M m x is the m th updated distribution and π is the target distribution.
Proof.Again, we use the total variation distance to bound the convergence: with u(1) = min(p(x)/π(x)) (Liu 1996).If we ignore the constraint in Theorem 1 stating that each vertex is contained in a cycle of length at least k + 1 then it is possible at any step during the process that the distributed weight among the chosen vertex is greater than the number of neighbours for this vertex which will cause algorithm 1 to stop and start again from another vertex.The probability of generating a subtree needs to be adjusted, for this, first we compute it conditionally on the success of algorithm 1.It is bounded by where c is the normalizing constant upper bounded by 1 as the distribution of weights is constrained by the success of algorithm 1. Next, we can use again bound 3 on the number of cycles as well as the success rate s to obtain: The convergence speed for this chain is O((s −1 nr k 2 2 )).
Corollary 1.For the random r−regular graph in Theorem 1, with a probability larger than 1 − α, the method restricted to path sampling verifies: Proof.The only thing to do is to replace factor k + 1 by 2 in the proof of Theorem 1 in equation 4 as there can be only two starting points for a path.

Second method: crawling uniform trees
As a first step, we initialize the algorithm by selecting a random vertex and adding k edges greedily to build a tree of order k + 1. Following, the crawling method modifies the initial subtree by shifting randomly one of its edges, i.e an edge is deleted randomly and another is randomly added.The Metropolis-Hastings acceptance rate is: where n G (x t ) denotes the set of neighbours for the tree x t in the graph G (V , E ) representing the Markov chain state space.Let n x t (v i ) denotes the set of neighbour vertices inside the tree x t for all vertices belonging to the tree x t , l x t represents the set of leaf vertices in the tree which are vertices connected to only one neighbour vertex in the tree x t and finally let n G (v i ) the set of neighbours for the tree leaves inside the graph G(V, E) which don't belong to the tree x t .Algorithm 4 presents the steps to generate a tree x * from a randomly initialized tree x t .
Algorithm 4 Generate a a tree x t+1 Require: A tree x t where V T = {v 1 , v 2 , . . ., v k+1 } and E T = {e 1 , e 2 , . . ., e k } Initialize x t+1 ← x t ∪ {v j } with probability α ← min 1, q(x t ,x t+1 ) q(x t+1 ,x t ) x t+1 ← x t ∪ {v i } with probability 1 − α end while This chain is aperiodic since q(x t , x * ) > 0 for all states x * which allow it to stay at the same state with a positive probability.Also, it is irreducible as will be seen along the proof of theorem 3.In order to bound the mixing time of a Markov chain, we will use the second smallest eigenvalue λ 1 by making the Markov chain lazy where a lazy chain stays in the current state at each step with probability at least 1/2.Factor 1/2 is a sufficient condition for the transition probability matrices to have only positive eigenvalues.First, let us recall that the Cheeger constant for graph G(V, E) is defined as where vol(S) stands for the sum of degrees of vertices in S and |E(S, S)| for the number of edges between S and S. Henceforth S will denote the subset of V realizing this minimum and without loss of generality we consider that S = min{S, S} so that h = |E(S, S)| vol(S) .
Lemma 1.Let h represents the Cheeger constant for the graph G(V, E) and vol(V ) denotes the sum of degrees of vertices in V , then there exists a set of paths Γ where b is the maximum number of paths containing an edge e such that b ≤ vol(V ) Proof.This is proven by contradiction.First let us recall that a path is a sequence of edges which connect a sequence of distinct vertices.For convenience, c will stand for vol(V )/h .Let us suppose that whichever the set of paths Γ there is always an edge e with at least c + 2 paths containing it.Then, e is an edge between two vertices a and b.Let us suppose that there is a path p between a and b with edges supporting less than c + 1 paths, then edge e could have less than c + 2 paths crossing it by replacing a path x → a → b → y which crosses e by the new path x → a → p → b → y.So in any path between a and b one will meet at some point an edge supporting at least c + 1 paths.The consequence of it, is that there is a graph cut between a and b containing only edges with at least c + 1 paths.This graph cut creates two subgraphs, A and B containing respectively a and b.We show now that this leads to a contradiction.Indeed, by definition of S By hypothesis, each edge in |E(A, B)| belongs to at least c + 1 paths and in particular e supports c + 2 paths.It is then immediate, that the mean number of paths per edge belonging to |E(A, B)| is strictly greater than c+1 which is impossible when observing the last equation.This concludes the proof.
Theorem 3. Let a be the maximum number of subtrees associated to a vertex , d max is the graph maximum degree, D is the graph diameter and λ 1 is the second smallest eigenvalue on the graph then for any starting state x such that M m x is the m th updated distribution and π is the target distribution we have: Proof.Again, let G (V , E ) be the Markov chain state space or simply the graph of subtrees with k edges.According to Sinclair (1992), the Markov chain has the following bound on its second smallest eigenvalue: where P (e 1 , e 2 ) is the transition matrix and b stands for the maximum number of paths crossing an edge e in the graph G (V , E ).At this point, we have to make P (e 1 , e 2 ) more explicit, indeed Q(e 1 , e 2 ) is the Markov chain related to the Metropolis-Hastings algorithm being used, that is: q(e 2 , e 1 ) q(e 1 , e 2 ) ∧ 1 q(e 1 , e 2 ) where q(e 1 , e 2 ) is the proposal law = q(e 1 , e 2 ) ∧ q(e 2 , e 1 ) where the probability 1 2 is due to the laziness.Thus, First, let us start by bounding the value of b .Here we need to use conductance information on the graph G(V, E) given by b.The set of paths Γ on G (V , E ) is derived from the set of paths Γ in G(V, E) in the following way.To each vertex v ∈ V is associated the set of subtrees containing it and denoted by T (v).So if we want to have a path between two subtrees s and t, we look for v s ∈ V and v t ∈ V such that s ∈ T (v s ) and t ∈ T (v t ).Then, if there is a path between v s and v t , there is an associated path between s and t, denoted by . This is possible as v i ∼ v i+1 , so to obtain t(v i+1 ) connected to t(v i ) it is enough to remove the farthest edge of t(v i ) to t(v i+1 ) and replace it by the edge v i v i+1 .By the way, this shows that D ≤ D + k.
The additional term is needed because it may happen that the first subtree reaching v t is different from t so that additional moves are needed, to a maximum of k.Now, if we denote by a the maximum number of subtrees associated to a vertex we can bound b noting that an edge e of G (V , E ) will be crossed as many times as there are paths in Γ associated to paths in Γ crossing edges v i , v j where v i is a vertex of s and v j a vertex of t.That makes (k + 1) 2 pairs multiplied by the number of times these pairs can be used, so: Then, the bound on b of lemma 1 can be injected and we get Besides, by Cheeger inequality Consequently, injecting bound D ≤ D + k and equation 10 into 8 we get

Experimental evaluation
In this section we examine the theoretical results using simulations on three different types of graphs: Erdős-Rényi graph, regular graph and a barbell graph variant.The barbell graph consists of two connected grids, each grid contains a finite number of adjacent connected vertices where the connections between any two adjacent vertices represent an undirected edge.We were careful to generate grids with different structures for the same graph, i.e any vertex in the first grid can be connected to another vertex only in a horizontal or vertical direction so the maximum number of edges touching a vertex is 4 whereas any vertex in the second grid can touch a maximum number of 8 neighbour vertices since it is possible for it to connect also diagonally, this is justified because we want to detect the behaviour of the sampling methods with different structures.
During the simulations, we sampled subtrees of size 5, 10 and 20 from graphs of size between 1000 and 1 million.To guarantee the efficiency of theses simulations, a burn-in period of size 1000 iterations was applied and we sampled only one sample each 1000 iterations.
Figure 5 presents the ACF for the diameter of subtrees of the same size sampled from graphs with different vertex degrees.When sampling from a 4−regular graph, it is clear that the chain converged quickly unlike the case when we increased the degree, r = 40, then chain took more time to converge.This result is compatible to the one achieved in the theoretical part which proved the effect of the vertex degree r in the r−regular graphs on the convergence speed of the independent method and it was assured that the convergence speed is at most O(nr k 2 /2 ).A simulation has been performed on a complete graph, which is the worst case for the independent method, in that case only star subgraphs have been produced.

Sampling from a barbell graph variant
This part was designed to study the effect of the graph's bottleneck on the efficiency of the presented sampling methods.Many cases were considered in this part to monitor the effect of the bottleneck on the mixing time.More precisely, we connected both grids, to construct the graph, by only one edge, 17 edges and then 34 edges and each grid consists of (174 * 174) vertices.Although the ACF in the plots of the diameter presented in Figure 6 seems to converge when sampling using the crawling method it is not in reality.This result was concluded after reviewing the sampled subtrees, indeed all subtrees were sampled only from the first grid of the graph which means that the chain didn't move to the second grid so the resulted sample does not represent the whole set of graph subtrees of size k.For the independent method, although it was successful to sample subtrees from this type of graphs it is clear that the chain took more time to converge when increasing the number of edges between both grids.The reason of this behaviour is that the structure of the graph is more complicated for the independent method and to reach convergence the chain needs to run for more iterations.

Conclusion
We presented two Markov chains where the convergence speed for the independent chain is O(s −1 nr k 2 /2 ) whereas it is O(λ −1 1 a 2 d 2 max k 5 ) for the non-independent method which could be very large when the bottleneck is narrow.Considering the bound on the total variation distance in the crawling method different parameters appear, λ 1 which is the second smallest eigenvalue that can be computed on the graph G(V, E) and a the maximum number of subtrees associated to a vertex.Parameter a is more difficult (impossible often in practice) to compute.Indeed, for a r−regular graph it is obvious that for a tree of size k, a ≤ r k which can be large.So, for a general random graph where a is known to be not too large then the crawling method could be better.According to our simulations, the theoretical results were confirmed.The independent method showed better performance only when sampling subtrees from a graph with a narrow bottleneck whereas the crawling method was the best in the case of sampling from other types of graphs.As a result, we recommend to use a combination of both methods for sampling uniform subtrees, i.e to sample trees in two steps.Firstly one should sample subtrees globally from the graph using the independent method and after that to sample locally through the crawling method.
xy |π(x)π(y) where |γ xy | stands for the length of the path γ xy and Q(e) = π(e 1 )P (e 1 , e 2 ) if e is the edge (e 1 , e 2 ).K can be bounded in the following way K ≤ max e 1 π(e 1 )P (e 1 , e 2 ) γxy e π(x)π(y)D where D is the diameter of G (V , E )

Figure 5 :
Figure 5: The ACF plot for the diameter of subtrees sampled by the independent methods from two graphs with different vertex degree

Figure 8 :
Figure 8: The ACF plot for the number of leaves of subtrees sampled using both methods