Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology

Menahem Friedman, Mark Last, Yaniv Makover, Abraham Kandel

Research output: Contribution to journalArticlepeer-review

36 Scopus citations

Abstract

Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters. We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method's performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests.

Original languageEnglish
Pages (from-to)467-475
Number of pages9
JournalInformation Sciences
Volume177
Issue number2
DOIs
StatePublished - 15 Jan 2007

Keywords

  • Anomaly detection
  • Cosine similarity
  • Document clustering
  • Fuzzy-based clustering

Fingerprint

Dive into the research topics of 'Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology'. Together they form a unique fingerprint.

Cite this