TY - JOUR
T1 - Learning dataset representation for automatic machine learning algorithm selection
AU - Cohen-Shapira, Noy
AU - Rokach, Lior
N1 - Funding Information:
This study was supported by grants from the National High Level Hospital Clinical Research Funding (Scientific and Technological Achievements Transformation Incubation Guidance Fund Project of Peking University First Hospital, 2022CX04), Capital’s Funds for Health Improvement and Research (2022-2Z-40712), the National Key R&D Program of China (2016YFC0904900), and National Natural Science Foundation of China (81872940, 81973395, and 82073935).
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
PY - 2022/8/4
Y1 - 2022/8/4
N2 - The algorithm selection problem is defined as identifying the best-performing machine learning (ML) algorithm for a given combination of dataset, task, and evaluation measure. The human expertise required to evaluate the increasing number of ML algorithms available has resulted in the need to automate the algorithm selection task. Various approaches have emerged to handle the automatic algorithm selection challenge, including meta-learning. Meta-learning is a popular approach that leverages accumulated experience for future learning and typically involves dataset characterization. Existing meta-learning methods often represent a dataset using predefined features and thus cannot be generalized across different ML tasks, or alternatively, learn a dataset’s representation in a supervised manner and therefore are unable to deal with unsupervised tasks. In this study, we propose a novel learning-based task-agnostic method for producing dataset representations. Then, we introduce TRIO, a meta-learning approach, that utilizes the proposed dataset representations to accurately recommend top-performing algorithms for previously unseen datasets. TRIO first learns graphical representations for the datasets, using four tools to learn the latent interactions among dataset instances and then utilizes a graph convolutional neural network technique to extract embedding representations from the graphs obtained. We extensively evaluate the effectiveness of our approach on 337 datasets and 195 ML algorithms, demonstrating that TRIO significantly outperforms state-of-the-art methods for algorithm selection for both supervised (classification and regression) and unsupervised (clustering) tasks.
AB - The algorithm selection problem is defined as identifying the best-performing machine learning (ML) algorithm for a given combination of dataset, task, and evaluation measure. The human expertise required to evaluate the increasing number of ML algorithms available has resulted in the need to automate the algorithm selection task. Various approaches have emerged to handle the automatic algorithm selection challenge, including meta-learning. Meta-learning is a popular approach that leverages accumulated experience for future learning and typically involves dataset characterization. Existing meta-learning methods often represent a dataset using predefined features and thus cannot be generalized across different ML tasks, or alternatively, learn a dataset’s representation in a supervised manner and therefore are unable to deal with unsupervised tasks. In this study, we propose a novel learning-based task-agnostic method for producing dataset representations. Then, we introduce TRIO, a meta-learning approach, that utilizes the proposed dataset representations to accurately recommend top-performing algorithms for previously unseen datasets. TRIO first learns graphical representations for the datasets, using four tools to learn the latent interactions among dataset instances and then utilizes a graph convolutional neural network technique to extract embedding representations from the graphs obtained. We extensively evaluate the effectiveness of our approach on 337 datasets and 195 ML algorithms, demonstrating that TRIO significantly outperforms state-of-the-art methods for algorithm selection for both supervised (classification and regression) and unsupervised (clustering) tasks.
KW - Algorithm selection
KW - AutoML
KW - Meta-learning
KW - Task-agnostic dataset representation
UR - http://www.scopus.com/inward/record.url?scp=85135401097&partnerID=8YFLogxK
U2 - 10.1007/s10115-022-01716-2
DO - 10.1007/s10115-022-01716-2
M3 - Article
AN - SCOPUS:85135401097
SN - 0219-1377
VL - 64
SP - 2599
EP - 2635
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 10
ER -