TY - JOUR
T1 - Unsupervised discovery of non-trivial similarities between online communities
AU - Israeli, Abraham
AU - Cohen, Shani
AU - Tsur, Oren
N1 - Funding Information:
We thank Dr. Hila Gonen for valuable discussions and advice about this research.
Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/11/15
Y1 - 2022/11/15
N2 - Language is used differently across communities. The differences may be manifested in vocabulary, style, and semantics. These differences enable the exploration of nuanced similarities and differences between communities. In this work, we introduce C3 — a novel unsupervised approach for community comparison. C3 creates contextual pairwise representations by aligning communities and tuning word embeddings according to both the lexical context and the social context reflected by the community's structure and the community engagement patterns. Specifically, C3 takes into account the semantic relations between pairs of words, reflected by the embeddings model of each community, and leverages the social context and users’ role in their community to calculate a similarity measure between community pairs. C3 is evaluated over a dataset of 1565 active Reddit communities, comparing results against three competitive models. We show through an array of experiments and validations that C3 recovers nuanced and not-trivial similarities between communities that are not captured by any of the competitive models. We complement the quantitative results with a qualitative analysis, discussing recovered non-trivial similarities between community pairs such as: opiates and adhd, babyBumps and depression, wallStreetBets and sandersForPresident, all of which are recovered by C3 but not by any of the other models. This qualitative analysis demonstrates the exploratory power of our model.
AB - Language is used differently across communities. The differences may be manifested in vocabulary, style, and semantics. These differences enable the exploration of nuanced similarities and differences between communities. In this work, we introduce C3 — a novel unsupervised approach for community comparison. C3 creates contextual pairwise representations by aligning communities and tuning word embeddings according to both the lexical context and the social context reflected by the community's structure and the community engagement patterns. Specifically, C3 takes into account the semantic relations between pairs of words, reflected by the embeddings model of each community, and leverages the social context and users’ role in their community to calculate a similarity measure between community pairs. C3 is evaluated over a dataset of 1565 active Reddit communities, comparing results against three competitive models. We show through an array of experiments and validations that C3 recovers nuanced and not-trivial similarities between communities that are not captured by any of the competitive models. We complement the quantitative results with a qualitative analysis, discussing recovered non-trivial similarities between community pairs such as: opiates and adhd, babyBumps and depression, wallStreetBets and sandersForPresident, all of which are recovered by C3 but not by any of the other models. This qualitative analysis demonstrates the exploratory power of our model.
KW - Computational social science
KW - Machine learning
KW - Natural language processing
KW - Online communities
KW - Social network analysis
KW - Word embeddings
UR - http://www.scopus.com/inward/record.url?scp=85132912188&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2022.117900
DO - 10.1016/j.eswa.2022.117900
M3 - Article
AN - SCOPUS:85132912188
SN - 0957-4174
VL - 206
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 117900
ER -