TY - GEN
T1 - Data-driven Coreference-based Ontology Building
AU - Ashury-Tahan, Shir
AU - Cohen, Amir David Nissan
AU - Cohen, Nadav
AU - Louzoun, Yoram
AU - Goldberg, Yoav
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology 1 under a creative-commons license, along with the code.
AB - While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology 1 under a creative-commons license, along with the code.
UR - https://www.scopus.com/pages/publications/85217620007
U2 - 10.18653/v1/2024.findings-emnlp.834
DO - 10.18653/v1/2024.findings-emnlp.834
M3 - Conference contribution
AN - SCOPUS:85217620007
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
SP - 14290
EP - 14300
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
T2 - 2024 Findings of the Association for Computational Linguistics, EMNLP 2024
Y2 - 12 November 2024 through 16 November 2024
ER -