TY - JOUR
T1 - Iterative query selection for opaque search engines with pseudo relevance feedback
AU - Reuben, Maor
AU - Elyashar, Aviad
AU - Puzis, Rami
N1 - Funding Information:
The authors would like to thank Robin Levy-Stevenson for editing this article.
Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/9/1
Y1 - 2022/9/1
N2 - Retrieving information from an online search engine is the first and most important step in many data mining tasks, such as fake news detection. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (i.e., opaque) supporting short keyword queries. In these settings, it is challenging to retrieve all posts and comments discussing a particular news item automatically and on a large scale. In this paper, we propose a method for generating short keyword queries given a prototype document. The proposed iterative query selection (IQS) algorithm interacts with the opaque search engine to iteratively improve the query, by maximizing the number of relevant results retrieved. Our evaluation of IQS was performed on the Twitter TREC Microblog 2012 and TREC-COVID 2019 datasets and demonstrated the algorithm's superior performance compared to state-of-the-art. In addition, we implemented IQS algorithm to automatically collect a large-scale dataset for fake news detection task of about 70K true and fake news items. The dataset, which we have made publicly available to the research community, includes over 22M accounts and 61M tweets. We demonstrate the usefulness of the dataset for fake news detection task achieving state-of-the-art performance.
AB - Retrieving information from an online search engine is the first and most important step in many data mining tasks, such as fake news detection. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (i.e., opaque) supporting short keyword queries. In these settings, it is challenging to retrieve all posts and comments discussing a particular news item automatically and on a large scale. In this paper, we propose a method for generating short keyword queries given a prototype document. The proposed iterative query selection (IQS) algorithm interacts with the opaque search engine to iteratively improve the query, by maximizing the number of relevant results retrieved. Our evaluation of IQS was performed on the Twitter TREC Microblog 2012 and TREC-COVID 2019 datasets and demonstrated the algorithm's superior performance compared to state-of-the-art. In addition, we implemented IQS algorithm to automatically collect a large-scale dataset for fake news detection task of about 70K true and fake news items. The dataset, which we have made publicly available to the research community, includes over 22M accounts and 61M tweets. We demonstrate the usefulness of the dataset for fake news detection task achieving state-of-the-art performance.
KW - Fake news
KW - Opaque search engine
KW - Pseudo relevance feedback
KW - Query selection
UR - http://www.scopus.com/inward/record.url?scp=85128767984&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2022.117027
DO - 10.1016/j.eswa.2022.117027
M3 - Article
AN - SCOPUS:85128767984
SN - 0957-4174
VL - 201
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 117027
ER -