TY - GEN
T1 - Fast categorization of web documents represented by graphs
AU - Markov, A.
AU - Last, M.
AU - Kandel, A.
PY - 2007/1/1
Y1 - 2007/1/1
N2 - Most text categorization methods are based on the vector-space model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that is available from web document HTML tags. A recently developed graph-based representation of web documents can preserve the structural information. The new document model was shown to outperform the traditional vector representation, using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this chapter, three new, hybrid approaches to web document categorization are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using two model-based classifiers (C4.5 decision-tree algorithm and probabilistic Naïve Bayes) and several benchmark web document collections. The results demonstrate that the hybrid methods outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant increase in the categorization speed.
AB - Most text categorization methods are based on the vector-space model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that is available from web document HTML tags. A recently developed graph-based representation of web documents can preserve the structural information. The new document model was shown to outperform the traditional vector representation, using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this chapter, three new, hybrid approaches to web document categorization are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using two model-based classifiers (C4.5 decision-tree algorithm and probabilistic Naïve Bayes) and several benchmark web document collections. The results demonstrate that the hybrid methods outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant increase in the categorization speed.
UR - http://www.scopus.com/inward/record.url?scp=38549124327&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-77485-3_4
DO - 10.1007/978-3-540-77485-3_4
M3 - Conference contribution
AN - SCOPUS:38549124327
SN - 354077484X
SN - 9783540774846
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 56
EP - 71
BT - Advances in Web Mining and Web Usage Analysis - 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006, Revised Papers
PB - Springer Verlag
T2 - 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006
Y2 - 20 August 2006 through 20 August 2006
ER -