Classification of web documents using a graph model

Adam Schenker, Mark Last, Horst Bunke, Abraham Kandel

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

71 Scopus citations

Abstract

In this paper we describe work relating to classification of web documents using a graph-based model instead of the traditional vector-based model for document representation. We compare the classification accuracy of the vector model approach using the k- Nearest Neighbor (k-NN) algorithm to a novel approach which allows the use of graphs for document representation in the k-NN algorithm. The proposed method is evaluated on three different web document collections using the leave-one-out approach for measuring classification accuracy. The results show that the graph-based k-NN approach can outperform traditional vector-based k-NN methods in terms of both accuracy and execution time.

Original languageEnglish
Title of host publicationProceedings - 7th International Conference on Document Analysis and Recognition, ICDAR 2003
PublisherIEEE Computer Society
Pages240-244
Number of pages5
ISBN (Electronic)0769519601
DOIs
StatePublished - 1 Jan 2003
Event7th International Conference on Document Analysis and Recognition, ICDAR 2003 - Edinburgh, United Kingdom
Duration: 3 Aug 20036 Aug 2003

Publication series

NameProceedings of the International Conference on Document Analysis and Recognition, ICDAR
Volume2003-January
ISSN (Print)1520-5363

Conference

Conference7th International Conference on Document Analysis and Recognition, ICDAR 2003
Country/TerritoryUnited Kingdom
CityEdinburgh
Period3/08/036/08/03

Cite this