Learning Clusters through Information Diffusion

When information or infectious diseases spread over a network, in many practical cases, one can observe when nodes adopt information or become infected, but the underlying network is hidden. In this paper, we analyze the problem of finding communities of highly interconnected nodes, given only the infection times of nodes. We propose, analyze, and empirically compare several algorithms for this task. The most stable performance, that improves the current state-of-the-art, is obtained by our proposed heuristic approaches, that are agnostic to a particular graph structure and epidemic model.


INTRODUCTION
Diffusion processes in networks include spreading of infectious diseases [14], spread of computer viruses [36], promotion of products via viral marketing [16], propagation of information [31], etc. In many practical scenarios one can observe when nodes become infected but the underlying network is hidden. For instance, during an epidemic, a person becomes ill but cannot tell who infected her; in viral marketing, we observe when customers buy products but not who influenced their decisions; in Twitter, for each retweet only information about the root node of the cascade is available. The problem of hidden network inference received a lot of attention recently [7,27,29,30]. While these papers aim to recover the actual network connections, in many applications only some global properties of the underlying network are important. For example, in viral marketing, one may wish to find the most influential users, while for recommendation systems one may look for groups of users with similar preferences.
In this paper, we analyze the problem, which recently attracted interest in the literature [5,26], of inferring community structure of a given network based solely on cascades (e.g., information or epidemic) propagating through this network. Communities are groups of highly interconnected nodes with relatively few edges joining nodes of different groups [9], such as groups by interests in social networks or pages on related topics on the Web. Discovering communities is one of most prominent tasks in network analysis. Compared to the traditional community detection, our work is quite different, because we do not have the network available to us, we have only cascade data observed on this network. For each cascade, we observe only infected nodes and their infection timestamps.
We propose and analyze several algorithms to solve this problem and compare them to the state-of-the-art. The contributions of this paper are the following. First, we give a systematic treatment to the new problem of inferring community structure based on cascade data, as stated formally in Section 3. Second, we propose two types of approaches for this task: based on likelihood maximization under specific model assumptions, and based on clustering of a surrogate network (Section 4). Third, we conduct extensive experiments on a variety of real-world networks (see Section 5). We conclude that the most stable performance is obtained by our proposed heuristics that are agnostic to a particular graph structure and epidemic model. These heuristics work equally well on different networks, for epidemics of different types.

RELATED WORK 2.1 Network Inference from Cascades
A series of recent papers addressed the following task: by observing infection (activation) times of nodes in several cascades, infer the edges of the underlying network. NetInf algorithm developed in [11] is based on the maximization of the likelihood of the observed cascades, based on a specific epidemic model. To make optimization feasible, for each cascade NetInf considers only the most likely propagation tree. This was later improved by MultiTree [30] that includes all directed trees in the optimization and has better performance if the number of cascades is small. NetRate algorithm [28] infers not only edges, but also infection rates. NetRate builds on an epidemic model that is more tractable for theoretical analysis (we describe this model in Section 3.2). For this model the likelihood optimization problem turns out to be convex. ConNIe algorithm [21] also uses convex optimization, with the goal to infer transmission probabilities for all edges. In [13,29], it is additionally assumed that the underlying network is not static and the proposed algorithm InfoPath provides on-line estimates of the structure and temporal dynamics of the hidden network. KernelCascade [8] also extends NetRate, here the authors avoid assumptions about a particular parametric form of the influence function. The DANI algorithm [27] is interesting because it explicitly accounts for the community structure to enhance the inference of networks' edges. There are some other network inference algorithms not covered here, see, e.g., [7,12,22,32,34,35,37,38,40].

Community Inference from Cascades
To the best of our knowledge, the paper [4] extended in [5] for the first time addressed the problem of community detection given cascades. In [4,5], the cascade model includes influence of individual nodes, and a membership level of a node in each community is inferred using the maximum likelihood approach. The authors propose two algorithms: C-IC takes into account only participation of a node in a cascade; C-Rate includes the time stamps, but limits the node's influence by its own community. Recently, [26] proposed an alternative maximum likelihood approach, which exploits the Markov property of the cascades. As an input, similarity scores of node pairs are computed, based on their joint participation in cascades. The R-CoDi algorithm in [26] starts with a random partition, while D-CoDi starts with a partition obtained by DANI [27]. We use all four mentioned algorithms as our baselines.

PROBLEM SETUP 3.1 General Setup
Cascades. We observe a set of cascades C = {C 1 , . . . , C r } that propagate on a latent undirected network G = (V , E) with |V | = n nodes and |E| = m edges. Each cascade C ∈ C is a record of observed node activation times, i.e., where v i is a node, t C v i is its activation time in C, |C | = n C is a size of a cascade. Note that we do not observe who infected whom.
Communities. We assume that G is partitioned into communities: We expect to observe a high intra-community density of edges compared to inter-community density. In our experiments, the ground truth partitions A are available for all datasets (see Section 5.1.1). By observing only a set of cascades C we want to find a partition A ′ similar to A.

Cascade Models
3.2.1 SIR model. The main model for our experiments is a wellknown SIR (Susceptible-Infected-Recovered) model [15]. Each node in the network can be in one of the three states: susceptible, infected, or recovered. An infected node infects its susceptible neighbors with rate α, the infection is spread simultaneously and independently along all edges. An infected node recovers with rate β and then stops spreading infection in the network. For each cascade C we sample its own infection rate α C from the Lomax (shifted Pareto) distribution in order to model a variety of cascades: there can be minor or widely circulated news, small scale epidemics or pandemics, etc. The source node of a cascade is chosen uniformly at random.

SI model with bounded duration (SI-BD).
In some cases, SIR model might not be tractable for theoretical analysis, so we assume a simpler diffusion model introduced in [28]. In this model, again, an activated node infects its neighbors after an exponentially distributed time with intensity α, but there is no recovery rate. Instead, all nodes recover simultaneously at some threshold time T max and the epidemic stops. For simplicity we further assume T max to be fixed, but our methods allow varying T max for different epidemics.

Community-based SI-BD model (C-SI-BD).
Another model which allows for a simpler theoretical analysis is based on the setting from [33]. It is assumed that the spreading does not occur over the edges of a graph G but depends solely on the community structure A. As before, the first node of a cascade is chosen uniformly at random. Each activated node can infect all other susceptible nodes independently after an exponentially distributed time. If a susceptible node belongs to the same community, then the infection rate is α in , otherwise it is α out , α out < α in . Epidemic stops at time T max .

ALGORITHMS 4.1 Background on Likelihood Maximization
Recall that t C i denotes the activation time for node i in cascade C; we often omit the index C when the context is clear. Without loss of generality, for the first node of a cascade, we set t i = 0. Finally, if a node i is not infected during an epidemic, then we set The log-likelihood log L(C) of the cascade C for SI-BD with varying infection rates (i infects a susceptible neighbor j after an exponentially distributed time with rate α i, j ) is given in [28]. For our purposes, it is convenient to write this expression as: The log-likelihood for all cascades is log L(C) = C ∈ C log L(C). We will next introduce two methods based on maximizing log L(C).

ClustOpt
Consider the C-SI-BD model described in Section 3.2.3. Denote by a(i) the community assignment for a node i. Then C-SI-BD is equivalent to the model used in [28] with α i, j = α in if a(i) = a(j) and α i, j = α out otherwise. The log-likelihood in (1) becomes: .
We propose the following algorithm to maximize (2).
Let us now discuss how the steps (1)-(3) are implemented.
Step (1): We have noticed that a stable and fast approach is to start from some initial reasonable partition and optimize over α in , α out and A only once, thus avoiding iterations of the costly optimization steps (2) and (3). In the current paper we propose to start from Cliqe(0) explained in Section 4.4.
Step (2): Without loss of generality, we set α in = (δ + 1)α out . Substituting this into (2), we find optimal α out in terms of δ and A: Unfortunately, due to the summation of logarithms in (2), we cannot find another simple analytical relation between the optimal values of δ and α out . Hence, we resort to a numerical solution. Due to (3), it is sufficient to numerically find only the optimal δ , this will give us the optimal values for both α in and α out .
Step (3): We follow [25], where the Louvain algorithm [6] is modified for optimizing a wide range of functions that depend on community structure, besides the original modularity measure. We adapt this algorithm for the likelihood given in (2) by computing the gain in log L(C, A) obtained by moving a node v from one community to another. In view of computational complexity, we consider only moving single nodes from one community to another and do not attempt to move groups of nodes or merge communities.

GraphOpt
In this section, we drop the assumption of fixed infection rates within and between communities. Instead, we assume the SI-BD cascade model defined in Section 3.2.2. Our task is much more complex in this setting because now we have a hidden graph G.
We propose an expectation-maximization-based method, where G is a latent variable. We assume the following generative probabilistic process. For a given partition A of n nodes we construct a graph G according to some model (distribution) P(G |A). Then, based on G, we generate a set of cascades C according to another model P(C|G). We observe C and our aim is to recover the partition A which maximizes the likelihood, i.e.,Ā = arg max A P(C|A) . Let us first explain how the standard expectation-maximization (EM) approach applies to this problem: (1) Choose some initial partitionÂ; (2) Find the distribution P(G |C,Â); (3) UpdateÂ:Â = arg max A E G log P(A G |C); (4) Iterate (2)-(3) until convergence.
This algorithm cannot be applied directly because, as we explain below, steps (2) and (3) are computationally intractable. Our algorithm GraphOpt is obtained by simplifying these two steps.
First, let us rewrite the distribution in (2): where we used that P(C|Â) does not depend on G and P(C|G,Â) = P(C|G) since, given the graph, the division into communities does not affect the cascades. Now, we rewrite the probability in (3) . Since we do not have any prior assumptions for A, we assume that P(A) is constant. Also, P(C) and P(C|G) do not depend on A, so arg max From (4) and (5) we see the computational bottlenecks of the EM algorithm. Indeed, even if for each G we can estimate the probability in (4), it is still impossible to even compute E G log P(G |A), because the expectation is over all possible realizations of G. Clearly, we cannot hope to maximize this expression over A. Therefore, we simplify the procedure: instead of computing the entire probability distribution over all graphs G in (4), we find only the most likely graphĜ, with the idea thatĜ gives the largest contribution to E G log P(G |A). Then the optimization problem in (5) is solved only forĜ. As a result, we obtain the following algorithm. Let us describe how each step of Algorithm 2 is implemented.
Step (1): We start from a trivial partitionÂ in which each node belongs to its own cluster. We also create an initial graph 1Ĝ = (V , E 0 ) as follows: for each cascade C consider the path of |C | − 1 edges, where each edge connects two nodes that are subsequently activated in this cascade; E 0 is a union of such paths over all C ∈ C.
Step (2): P(G |Â) is defined by the ILFR model proposed in [25], where the formula for log P(G |Â) is also provided. This model was shown to give the best fit to a variety of real-world networks in terms of likelihood. For P(C|G), we assume the SI-BD cascade model, so we use (1) with α i, j = α if i and j are connected in G and α i, j = 0 otherwise. Then log P(C|G) = C ∈ C log P(C |G) with A nice feature of (6) is that for given C and G it is easy to compute α that maximizes the overall likelihood: Finally, we need to findĜ = arg max G log P(G |Â) + log P(C|G) . We use a greedy approach to approximately findĜ: we iteratively add and remove edges to increase log P(G |Â) + log P(C|G) (the detailed description is omitted due to space constrains).
Step (3): This is a standard likelihood optimization task used in community detection. We use the Louvain-based algorithm proposed in [25].

Clustering of Surrogate Graphs
In this section, we present simple yet effective methods that construct a surrogate graphĜ and then cluster this graph (using the Louvain algorithm [6]). It is crucial thatĜ does not need to be similar to G, it just needs to capture the community structure on an aggregated level. Our experiments show that clustering ofĜ often performs better than first inferring G and then clustering it.
4.4.1 Path algorithm. Assume again that cascades C are generated by the SI-BD model. Now, for C ∈ C let G ′ = G ′ (C) be a graph with exactly |C | − 1 edges that maximizes the probability P(C |G ′ ) in (6). It is easy to check that G ′ is merely a path connecting subsequently activated nodes. Note that we have already used such paths to initialize Algorithm 2. This leads to the following algorithm.

Clique.
Another approach is to include all possible edges that could participate in the cascade weighing them by the proxy of their likelihood. Then each cascade C results in a weighted clique of size |C |. The weights can be chosen, for example, as follows. For a cascade C and two nodes i, j let us consider the probability P C (i, j) = P(j was infected from i |C). If t i > t j , then, obviously, P C (i, j) = 0. If t i < t j , then, as in [11], we assume that P C (i, j) decreases exponentially with ∆ i, j ; namely, P C (i, j) = c j e −a∆ i, j , where c j is a constant depending on j. Since j was infected from exactly one previous node, we must have i:t i <t j P C (i, j) = 1.
Therefore, P C (i, j) = e −a∆ i, j l :t l <t j e −a∆ l, j .
Cliqe constructsĜ as the weighted graph with weight of {i, j} given by C ∈ C (P C (i, j) + P C (j, i)), which, under our assumptions, is the expected number of times infections passed between i and j. Note that we directly model P C (i, j) rather than making assumptions about infection times. Parameter a essentially balances between paths and cliques: if a is large, then we mostly take into account subsequent nodes with small ∆ i, j ; for small a all pairs of nodes participated in C are important. To make Cliqe insensitive to the speed of epidemics, we take a = 1 ∆ , where ∆ is the average time between infected times (we average over all pairs of infected nodes belonging to the same cascade). We also consider Cliqe(0) with a = 0, which is a natural choice for the SI-BD model because it mimics the memory-less property of exponential infection times. 3

Baselines
MultiTree uses the algorithm from [30] to find an inferred grapĥ G, which is then clustered by the Louvain algorithm [6].
Oracle algorithm can be considered as superior for all possible network inference algorithms: we construct a graphĜ consisting of all edges participated in cascades (assuming these edges are known, hence, the name Oracle). The obtained graphĜ is clustered by the Louvain algorithm.
Finally, we used algorithms proposed for solving the same problem as in our work: C-IC and C-Rate [5], as well as R-CoDi and D-CoDi [26]. We use the publicly available implementations provided by the authors. For C-IC and C-Rate we use the real number of communities as an input parameter. 4

Metric.
We evaluate the quality of the obtained communities compared to the ground truth using the Normalized Mutual Information (NMI) similarity measure [2,9]. Let V 0 ⊂ V denote the set of all nodes that participated in cascades. To compute NMI, we assign all nodes in V \V 0 to one cluster labeled "unknown".

Cascades.
We took the datasets with ground truth communities described in Section 5.1.1 and, in order to cover various possible diffusion processes, we generated the cascades according to all models discussed in Section 3.2. Importantly, there are both edge-based and community-based models.
W.l.o.g., we set β = 1 for the SIR model and T max = 1 for SI-BD and C-SI-BD. We noticed that to get an informative set of cascades, the parameters α, α in , α out , and the parameter of the Lomax distribution used by SIR have to be different for different datasets. For SIR and SI-BD we choose parameters so that the average size of a cascade is 2; for C-SI-BD we take α in = 10α out and choose α out such that the number of cascades consisting of one node is about 20%. We checked that the obtained distribution of cascade sizes is similar to observed for real data [3,10,12,17,18]. Single-node cascades were removed.

Results
We have run the algorithms on all datasets and varied the number of cascades |C|. Figures 1 and 2 show the results obtained for SIR and C-SI-BD models, respectively. 5 All results are averaged over 5 samples of generated cascades C. We noticed that the results are quite different for different datasets and cascades models. Nevertheless, the proposed Cliqe algorithm has the most stable performance, implying that the corresponding surrogate graph is able to capture the information about the community structure. Importantly, both Cliqe(0) and Cliqe work equally well for all cascade models since they are not based on a particular one, as discussed in Section 4.4.
Note that Path is usually worse than Cliqe as it uses less information.
The baselines C-IC and C-Rate are very unstable and in many cases they show the worst results. This is expected, since C-IC and C-Rate are based on a specific community-based model. Indeed, on C-SI-BD C-IC shows a reasonable performance in some cases, especially on Political blogs. Note that C-IC is almost always better than C-Rate (except for some cascade sizes on Karate with SIR model). D-CoDi and R-CoDi are essentially based on clustering of a surrogate graph, so their performance is stable for all epidemic models. However, in all cases D-CoDi failed on cascades of small sizes (we included only successful points). D-CoDi can be both better and worse than R-CoDi. Having reasonable performance in many cases, these two baselines may have unexpectedly bad points: see, e.g., the large number of cascades on Political blogs and Eu-Core with all epidemic models. MultiTree turns out to have  GraphOpt shows an excellent performance, e.g., on Eu-core for all types of epidemics, on Football for SIR cascades and for small number of C-SI-BD cascades. However, in general its quality is unstable, and it is also much slower than surrogate-based methods.
Oracle is considered only as an upper bound for all network inference algorithms. 6 If the number of cascades is large enough, then Oracle essentially clusters the original graph. Interestingly, 6 We do not plot Oracle for C-SI-BD, since in this model graph is not used by the propagation processes and. In particular, for large | C | we may get a complete graph. in many cases Oracle is beaten by the surrogate-graph-based algorithms. This basically means that errors made by Louvain on original graph can be reduced by clever weighting of edges provided by the surrogate-graph-based algorithms.
To conclude, we propose using the universal Cliqe method for the problem of community inference based on cascade data. This algorithm is simple, efficient, and works well for all considered cascade models.

ACKNOWLEDGMENTS
This study was funded by RFBR according to the research project 18-31-00207.