LDA filter: A Latent Dirichlet Allocation preprocess method for Weka.

P Celard, A Seara Vieira, E L Iglesias, L Borrajo
Author Information
  1. P Celard: Computer Science Dept., Univ. of Vigo, Escuela Superior de Ingeniería Informática, Ourense, Spain. ORCID
  2. A Seara Vieira: Computer Science Dept., Univ. of Vigo, Escuela Superior de Ingeniería Informática, Ourense, Spain.
  3. E L Iglesias: Computer Science Dept., Univ. of Vigo, Escuela Superior de Ingeniería Informática, Ourense, Spain.
  4. L Borrajo: Computer Science Dept., Univ. of Vigo, Escuela Superior de Ingeniería Informática, Ourense, Spain. ORCID

Abstract

This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times.

References

  1. Bioinformatics. 2004 Oct 12;20(15):2479-81 [PMID: 15073010]
  2. Sensors (Basel). 2019 Aug 28;19(17): [PMID: 31466389]

MeSH Term

Algorithms
Bayes Theorem
Cluster Analysis
Humans
Information Storage and Retrieval
Support Vector Machine

Word Cloud

Created with Highcharts 10.0.0LDArepresentationfiltermethodLatentDirichletAllocationclassificationtextdocumentnewproposedtechniqueWekadifferentBoWworkpresentsalternativerepresentdocumentsbasedaffectsalgorithmscomparisoncommonassumesdealssetpredefinedtopicsdistributionsentirevocabularymainobjectiveuseprobabilitybelongingtopicimplementmodeldeployedextensionsoftwaredemonstrateperformancecreatedtestedclassifiersSupportVectorMachineSVMk-NearestNeighborsk-NNNaiveBayesdocumentalcorporaOHSUMEDReuters-2157820NewsgroupYahoo!AnswersYELPPolarityTRECGenomics2015comparedBagWordsResultssuggestapplicationachievessimilaraccuracygreatlyimprovesprocessingtimesfilter:preprocess

Similar Articles

Cited By