Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach.

Daniel Voskergian, Rashid Jayousi, Malik Yousef
Author Information
  1. Daniel Voskergian: Computer Engineering Department, Al-Quds University, Jerusalem, Palestine. daniel2vosk@gmail.com.
  2. Rashid Jayousi: Computer Science Department, Al-Quds University, Jerusalem, Palestine.
  3. Malik Yousef: Department of Information Systems, Zefat Academic College, Zefat, Israel. malik.yousef@gmail.com.

Abstract

TextNetTopics (Yousef et al. in Front Genet 13:893378, 2022. https://doi.org/10.3389/fgene.2022.893378 ) is a recently developed approach that performs text classification-based topics (a topic is a group of terms or words) extracted from a Latent Dirichlet Allocation topic modeling as features rather than individual words. Following this approach enables TextNetTopics to fulfill dimensionality reduction while preserving and embedding more thematic and semantic information into the text document representations. In this article, we introduced a novel approach, the Ensemble Topic Model for Topic Selection (ENTM-TS), an advancement of TextNetTopics. ENTM-TS integrates multiple topic models using the Grouping, Scoring, and Modeling approach, thereby mitigating the performance variability introduced by employing individual topic modeling methods within TextNetTopics. Additionally, we performed a thorough comparative study to evaluate TextNetTopics' performance using eleven state-of-the-art topic modeling algorithms. We used the extracted topics for each as input to the G component in the TextNetTopics tool to select the most compelling topic model regarding their predictive behavior for text classification. We conducted our comprehensive evaluation utilizing the Drug-Induced Liver Injury textual dataset from the CAMDA community and the WOS-5736 dataset. The experimental results show that the Latent Semantic Indexing provides comparable performance measures with fewer discriminative features when compared with other topic modeling methods. Moreover, our evaluation reveals that the performance of ENTM-TS surpasses or aligns with the optimal outcomes obtained from individual topic models across the two datasets, establishing it as a robust and effective enhancement in text classification tasks.

Keywords

References

  1. F1000Res. 2020 Oct 19;9:1255 [PMID: 33500779]
  2. Nature. 1999 Oct 21;401(6755):788-91 [PMID: 10548103]
  3. Front Genet. 2023 Oct 05;14:1243874 [PMID: 37867598]
  4. Entropy (Basel). 2020 Dec 22;23(1): [PMID: 33374969]
  5. BMC Bioinformatics. 2007 May 02;8:144 [PMID: 17474999]
  6. PeerJ. 2023 Jul 17;11:e15666 [PMID: 37483989]
  7. PeerJ Comput Sci. 2021 Feb 22;7:e336 [PMID: 33816987]
  8. Bioinformatics. 2019 Oct 15;35(20):4020-4028 [PMID: 30895309]
  9. PeerJ. 2021 May 19;9:e11458 [PMID: 34055490]
  10. BMC Bioinformatics. 2009 Oct 15;10:337 [PMID: 19832995]
  11. Front Genet. 2022 Apr 12;13:767455 [PMID: 35495139]
  12. Front Genet. 2023 Aug 21;14:1139082 [PMID: 37671046]
  13. Front Genet. 2022 Jun 20;13:893378 [PMID: 35795215]
  14. Heliyon. 2023 Nov 22;9(12):e22666 [PMID: 38090011]
  15. Front Big Data. 2022 May 04;5:846930 [PMID: 35600326]
  16. PLoS One. 2014 Jan 09;9(1):e82119 [PMID: 24416136]
  17. BMC Bioinformatics. 2023 Feb 23;24(1):60 [PMID: 36823571]
  18. Sci Rep. 2022 Nov 19;12(1):19955 [PMID: 36402891]
  19. Front Genet. 2023 Jan 12;13:1076554 [PMID: 36712859]
  20. Front Genet. 2023 Mar 15;14:1093326 [PMID: 37007972]
  21. Syst Rev. 2015 Nov 26;4:172 [PMID: 26612232]

Word Cloud

Created with Highcharts 10.0.0topicmodelingTextNetTopicsapproachtextTopicperformanceclassificationindividualENTM-TSusing2022topicswordsextractedLatentfeaturesintroducedEnsembleSelectionmodelsmethodsmodelevaluationdatasetselectionlearningYousefetalFrontGenet13:893378https://doiorg/103389/fgene893378recentlydevelopedperformsclassification-basedgrouptermsDirichletAllocationratherFollowingenablesfulfilldimensionalityreductionpreservingembeddingthematicsemanticinformationdocumentrepresentationsarticlenovelModeladvancementintegratesmultipleGroupingScoringModelingtherebymitigatingvariabilityemployingwithinAdditionallyperformedthoroughcomparativestudyevaluateTextNetTopics'elevenstate-of-the-artalgorithmsusedinputGcomponenttoolselectcompellingregardingpredictivebehaviorconductedcomprehensiveutilizingDrug-InducedLiverInjurytextualCAMDAcommunityWOS-5736experimentalresultsshowSemanticIndexingprovidescomparablemeasuresfewerdiscriminativecomparedMoreoverrevealssurpassesalignsoptimaloutcomesobtainedacrosstwodatasetsestablishingrobusteffectiveenhancementtasksensemblegroupingscoringFeatureMachineText

Similar Articles

Cited By