School of Engineering

Graduates the engineering leaders of tomorrow...

Cluster-based Organization of Documents using Corpus-based Text Analysis

This study introduces an approach for cluster-based document organization using corpus-based text analysis. More specifically, our solution accepts as input a bunch of unstructured text documents. It extracts their term frequency and document frequency text features from the corpus, and augments the latter features with term co-occurrence and relatedness scores produces from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then processed through a hierarchical clustering process to identify groups of similar documents, which can serve as candidate for thematic organization and topic extraction at a later stage. Report in pdf


Copyright 1997–2021 Lebanese American University, Lebanon.
Contact LAU | Feedback