Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling.

Anton Thielmann, Christoph Weisser, Astrid Krenz, Benjamin Säfken
Author Information
  1. Anton Thielmann: Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany.
  2. Christoph Weisser: Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany.
  3. Astrid Krenz: Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany.
  4. Benjamin Säfken: Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany.

Abstract

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

Keywords

References

  1. Neural Comput. 2001 Jul;13(7):1443-71 [PMID: 11440593]
  2. Amino Acids. 2010 Nov;39(5):1385-91 [PMID: 20411285]
  3. Database (Oxford). 2019 Jan 1;2019: [PMID: 31032839]

Word Cloud

Created with Highcharts 10.0.0dataclassificationUnsuperviseddocumentone-classsetstrainingwebscrapingSVMLDAtopicrequireslabellingintegrationmodellingout-of-domainmachinelearningimbalancedposesmajorchallengeobtainaccurateresultsoftencreatedmanuallyhumansexpertknowledgetimemoneyDependingimbalancesetapproachalsoeitherhumanfailsadequatelyrecognizeunderrepresentedcategoriesproposeSupportVectorMachinesLatentDirichletAllocationmulti-steprulecircumventsmanualachieved>80%targetcorrectlyclassifiedproposedmethodthusevenoutperformscommonclassifiersvalidatedmultipleintegratingmodel

Similar Articles

Cited By