Cluster-based Organization of Documents using Corpus-based Text Analysis
This study introduces an approach for cluster-based document organization using corpus-based text analysis. More specifically, our solution accepts as input a bunch of unstructured text documents. It extracts their term frequency and document frequency text features from the corpus, and augments the latter features with term co-occurrence and relatedness scores produces from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then processed through a hierarchical clustering process to identify groups of similar documents, which can serve as candidate for thematic organization and topic extraction at a later stage. Report in pdf