Multi-domain Urdu fake news detection using pre-trained ensemble model.

Sheetal Harris, Hassan Jalil Hadi, Naveed Ahmad, Mohammed Ali Alshara
Author Information
  1. Sheetal Harris: School of Cyber Science and Engineering, Wuhan University, Wuhan, China.
  2. Hassan Jalil Hadi: School of Cyber Science and Engineering, Wuhan University, Wuhan, China. hjhwhu@whu.edu.cn.
  3. Naveed Ahmad: Prince Sultan University, Riyadh, Saudi Arabia.
  4. Mohammed Ali Alshara: Prince Sultan University, Riyadh, Saudi Arabia.

Abstract

Fake News (FN) dissemination on websites and online platforms influences human behaviours, socio-political domains, and the sovereignty of a country. The outpour of biased news and propaganda on online portals can be addressed by restricting online propaganda using an automated mechanism. Proving the authenticity of news and information on online platforms in regional languages, such as Urdu, with limited resources and datasets, is challenging. Furthermore, limited research in resource-constrained languages has created language bias in Artificial Intelligence (AI) research, which is concentrated in this study. Natural Language Processing (NLP) techniques have been used for Fake News Detection (FND) for English news and various language-related tasks. Previous studies used Machine Learning (ML), Deep Learning (DL), and individual Pre-trained Language Models (PLMs) for Urdu FND. ML-based ensemble model showed better performance than pre-trained models for Urdu FND. We propose a methodology for Urdu FND by applying stacked ensemble learning of PLMs, ELECTRA, mBERT and XLM-RoBERTa after apposite fine-tuning and hyperparameter optimization. To overcome the limitations of each pre-trained transformer model, these are fine-tuned individually using a publicly available Urdu dataset. The prediction performance results of the proposed stacking approach surpass the performance of each pre-trained model. An Accuracy of 0.914, a Matthews Correlation Co-efficient (MCC) value of 0.898, and an F1-score of 0.904 validate the efficacy of the proposed ensemble model.

References

  1. PeerJ Comput Sci. 2021 Mar 9;7:e425 [PMID: 33817059]
  2. BMC Genomics. 2020 Jan 2;21(1):6 [PMID: 31898477]
  3. PLoS One. 2017 Jun 2;12(6):e0177678 [PMID: 28574989]
  4. Sensors (Basel). 2024 Sep 19;24(18): [PMID: 39338806]
  5. Sci Rep. 2021 Dec 8;11(1):23705 [PMID: 34880354]
  6. Inf Process Manag. 2021 Sep;58(5):102610 [PMID: 36567974]

MeSH Term

Humans
Natural Language Processing
Machine Learning
Deception
Artificial Intelligence
Language
Internet
Information Dissemination

Word Cloud

Created with Highcharts 10.0.0UrdumodelonlinenewsFNDensemblepre-trainedusingperformance0FakeNewsplatformspropagandalanguageslimitedresearchLanguageusedLearningPLMsproposedFNdisseminationwebsitesinfluenceshumanbehaviourssocio-politicaldomainssovereigntycountryoutpourbiasedportalscanaddressedrestrictingautomatedmechanismProvingauthenticityinformationregionalresourcesdatasetschallengingFurthermoreresource-constrainedcreatedlanguagebiasArtificialIntelligenceAIconcentratedstudyNaturalProcessingNLPtechniquesDetectionEnglishvariouslanguage-relatedtasksPreviousstudiesMachineMLDeepDLindividualPre-trainedModelsML-basedshowedbettermodelsproposemethodologyapplyingstackedlearningELECTRAmBERTXLM-RoBERTaappositefine-tuninghyperparameteroptimizationovercomelimitationstransformerfine-tunedindividuallypubliclyavailabledatasetpredictionresultsstackingapproachsurpassAccuracy914MatthewsCorrelationCo-efficientMCCvalue898F1-score904validateefficacyMulti-domainfakedetection

Similar Articles

Cited By

No available data.