Classification of web documents using concept extraction from ontologies

Marina Litvak, Mark Last, Slava Kisilevich

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed ontology and a classification algorithm, and classification of new documents by information agents via the induced model. We evaluated the proposed methodology in two specific domains: the chemical domain (web pages containing information about production of certain chemicals), and Yahoo! collection of web news documents divided into several categories. Our system receives as input the domain-specific ontology, and a set of categorized web documents, and then perfroms concept generalization on these documents. We use a key-phrase extractor with integrated ontology parser for creating a database from input documents and use it as a training set for the classification algorithm. The system classification accuracy is estimated using various levels of ontology.

Original languageEnglish
Title of host publicationAutonomous Intelligent Systems
Subtitle of host publicationAgents and Data Mining - Second International Workshop, AIS-ADM 2007, Proceedings
PublisherSpringer Verlag
Pages287-292
Number of pages6
ISBN (Print)9783540728382
DOIs
StatePublished - 1 Jan 2007
Event2nd International Workshop Autonomous Intelligent Systems: Agents and Data Mining, AIS-ADM 2007 - St. Petersburg, Russian Federation
Duration: 3 Jun 20075 Jun 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4476 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2nd International Workshop Autonomous Intelligent Systems: Agents and Data Mining, AIS-ADM 2007
Country/TerritoryRussian Federation
CitySt. Petersburg
Period3/06/075/06/07

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Classification of web documents using concept extraction from ontologies'. Together they form a unique fingerprint.

Cite this