Data efficient molecular image representation learning using foundation models

Yonatan Harnik, Hadas Shalit Peleg, Amit H. Bermano, Anat Milo

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Deep learning (DL) in chemistry has seen significant progress, yet its applicability is limited by the scarcity of large, labeled datasets and the difficulty of extracting meaningful molecular features. Molecular representation learning (MRL) has emerged as a powerful approach to address these challenges by decoupling feature extraction and property prediction. In MRL, a deep learning network is first trained to learn molecular features from large, unlabeled datasets and then finetuned for property prediction on smaller specialized data. Whereas MRL methods have been widely applied across chemical applications, these models are typically trained from scratch. Herein, we propose that foundation models can serve as an advantageous starting point for developing MRL models. Foundation models are large models trained on diverse datasets capable of addressing various downstream tasks. For example, large language models like OpenAI's GPT-4 can be finetuned with minimal additional data for tasks considerably different from their training. Based on this premise we leveraged OpenAI's vision foundation model, CLIP, as the backbone for developing MoleCLIP, a molecular image representation learning framework. MoleCLIP requires significantly less molecular pretraining data to match the performance of state-of-the-art models on standard benchmarks. Furthermore, MoleCLIP outperformed existing models on homogeneous catalysis datasets, emphasizing its robustness to distribution shifts, which allows it to adapt effectively to varied tasks and datasets. This successful application of a general foundation model to chemical tasks highlights the potential of innovations in DL research to advance synthetic chemistry and, more broadly, any field where molecular property description is central to discovery.

Original languageEnglish
Pages (from-to)10833-10841
Number of pages9
JournalChemical Science
Volume16
Issue number24
DOIs
StatePublished - 22 May 2025

ASJC Scopus subject areas

  • General Chemistry

Fingerprint

Dive into the research topics of 'Data efficient molecular image representation learning using foundation models'. Together they form a unique fingerprint.

Cite this