Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification

Abraham Itzhak Weinberg, Mark Last

Research output: Contribution to journalArticlepeer-review

34 Scopus citations

Abstract

The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most similar to others as the best representative of the entire dataset. The similarity-based approach is implemented with three different similarity metrics: a syntactic, a semantic, and a linear combination of the two. We compare this tree selection methodology to a popular ensemble algorithm (majority voting) and to the baseline of randomly choosing one of the local models. In addition, we evaluate two alternative tree selection strategies: choosing the tree having the highest validation accuracy and reducing the original ensemble to five most representative trees. The comparative evaluation experiments are performed on six big datasets using two popular decision-tree algorithms (J48 and CART) and splitting each dataset horizontally into six different amounts of equal-size slices (from 32 to 1024). In most experiments, the syntactic similarity approach, named SySM—Syntactic Similarity Method, provides a significantly higher testing accuracy than the semantic and the combined ones. The mean accuracy of SySM over all datasets is 0.835 ± 0.065 for CART and 0.769 ± 0.066 for J48. On the other hand, we find no statistically significant difference between the testing accuracy of the trees selected by SySM and the trees having the highest validation accuracy. Comparing to ensemble algorithms, the representative models selected by the proposed methods provide a higher speed for big data classification along with being more compact and interpretable.

Original languageEnglish
Article number23
JournalJournal of Big Data
Volume6
Issue number1
DOIs
StatePublished - 1 Dec 2019

Keywords

  • Big data
  • Decision trees
  • Editing distance
  • Ensemble learning
  • Lazy ensemble evaluation
  • Tree similarity

ASJC Scopus subject areas

  • Information Systems
  • Hardware and Architecture
  • Computer Networks and Communications
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification'. Together they form a unique fingerprint.

Cite this