To access and analyze a large online social network (OSN) with millions of users, such as a relationship network (e.g. Facebook), efficient computational methods must be utilized. Network size or data accessibility (e.g. API limitations), however, make analysis of a relationship network difficult if not impossible. To cope with this problem, we propose an algorithm for analyzing the dynamic activity graph of user interactions online (e.g., content sharing or tweeting) that refelects the relationship network, instead of the large relationship network (represented as a graph). To make the computations required for OSN analysis feasible and manageable, the proposed algorithm generates a representative sub-graph of the large relationship network based on the dynamic activity graph. Thus, a complicated analysis of the large relationship graph is reduced to a simpler analysis of the representative sub-graph. Recently-published reasearch suggest the use of graph sampling algorithms to cope with analyses of large graphs. However, these algorithms assume that access to the relationship graph is feasible and, hence, direct sampling is possible. In this research, instead of assuming feasibility of access to the relationship graph, our algorithm utilizes the smaller dynamic activity graph to generate a representative and unbiased sub-graph of the large relationship graph. The datasets used to evaluate the proposed algorithm are based on two Facebook (FB) networks and two Twitter (TW) networks. The first FB network describes a friendship relationship network which is represented by a static directed graph, with 63,731 nodes and 1,545,686 edges. In FB, a user can interact with friends by posting comments to their wall, and the second FB network is thus an activity network which represents the dynamic wall activity during 52 months of users in the first FB network, with 13,478 nodes and 16,624 edges. The first TW network is a static relationship network with 456,626 nodes and 14,855,842 edges represented as a directed graph, where each node is a tweet author and each edge is a representative of follower or being-followed relationship. The second TW network is a dynamic activity network which describes retweet, mention, and reply user interactions in the first TW network, with a total of 304,691 nodes and 461,192 edges, after splitting by one-hour intervals that yielded 168 observations. After defining a set of graphs properties, performance of the proposed algorithm was tested, in terms of preserving the following node average and distribution statistics, for different parameters and varying sample sizes: the degree distribution, the clustering coefficient, and the path-length. Overall, forest fire sampling (FFS) performed best among all algorithms that must access the relationship graph (average D-statistic 0.27 in TW, and 0.29 in FB). Whereas our algorithm closely followed FFS in TW (average D-statistic of 0.29) and outperformed FFS (average D-statistic of 0.23) on FB.
|Original language||English GB|
|State||Published - 28 Jun 2018|
|Event||XXXVIII Sunbelt 2018, Utrecht - Utrecht University, Utrecht, Netherlands|
Duration: 27 Jun 2018 → 1 Jul 2018
|Conference||XXXVIII Sunbelt 2018, Utrecht|
|Period||27/06/18 → 1/07/18|