A graph-based framework for web document mining

Adam Schenker, Horst Bunke, Mark Last, Abraham Kandel

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

8 Scopus citations

Abstract

In this paper we describe methods of performing data mining on web documents, where the web document content is represented by graphs. We show how traditional clustering and classification methods, which usually operate on vector representations of data, can be extended to work with graph-based data. Specifically, we give graphtheoretic extensions of the k-Nearest Neighbors classification algorithm and the k-means clustering algorithm that process graphs, and show how the retention of structural information can lead to improved performance over the case of the vector model approach. We introduce several different types of web document representations that utilize graphs and compare their performance for clustering and classification.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
EditorsSimone Marinai, Andreas Dengel
PublisherSpringer Verlag
Pages401-412
Number of pages12
ISBN (Print)3540230602
DOIs
StatePublished - 1 Jan 2004

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3163
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science (all)

Fingerprint

Dive into the research topics of 'A graph-based framework for web document mining'. Together they form a unique fingerprint.

Cite this