Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization.

Waqar Ashiq, Samra Kanwal, Adnan Rafique, Muhammad Waqas, Tahir Khurshaid, Elizabeth Caro Montero, Alicia Bustamante Alonso, Imran Ashraf
Author Information
  1. Waqar Ashiq: Department of Software Engineering, University of Management and Technology, Lahore, 54590, Pakistan.
  2. Samra Kanwal: Department of Computer Science, University of Management and Technology, Lahore, 54590, Pakistan.
  3. Adnan Rafique: School of Information and Communications Technology, University of Tasmania, Launceston, 7250, Australia.
  4. Muhammad Waqas: Department of Mathematics, University of Education, Vehari, 61100, Pakistan.
  5. Tahir Khurshaid: Department of Electrical Engineering, Yeungnam University, Gyeongsan, 38541, Republic of Korea. tahir@ynu.ac.kr.
  6. Elizabeth Caro Montero: Universidad Europea del Atlantico., Isabel Torres 21, Santander, 39011, Spain.
  7. Alicia Bustamante Alonso: Universidad Europea del Atlantico., Isabel Torres 21, Santander, 39011, Spain.
  8. Imran Ashraf: Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, 38541, Republic of Korea. imranashraf@ynu.ac.kr.

Abstract

With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solve the automatic hate speech detection problem using different corpora in various languages, however, research on the Urdu language is rather scarce. This study aims to address the HSD task on Twitter using Roman Urdu text. The contribution of this research is the development of a hybrid model for Roman Urdu HSD, which has not been previously explored. The novel hybrid model integrates deep learning (DL) and transformer models for automatic feature extraction, combined with machine learning algorithms (MLAs) for classification. To further enhance model performance, we employ several hyperparameter optimization (HPO) techniques, including Grid Search (GS), Randomized Search (RS), and Bayesian Optimization with Gaussian Processes (BOGP). Evaluation is carried out on two publicly available benchmarks Roman Urdu corpora comprising HS-RU-20 corpus and RUHSOLD hate speech corpus. Results demonstrate that the Multilingual BERT (MBERT) feature learner, paired with a Support Vector Machine (SVM) classifier and optimized using RS, achieves state-of-the-art performance. On the HS-RU-20 corpus, this model attained an accuracy of 0.93 and an F1 score of 0.95 for the Neutral-Hostile classification task, and an accuracy of 0.89 with an F1 score of 0.88 for the Hate Speech-Offensive task. On the RUHSOLD corpus, the same model achieved an accuracy of 0.95 and an F1 score of 0.94 for the Coarse-grained task, alongside an accuracy of 0.87 and an F1 score of 0.84 for the Fine-grained task. These results demonstrate the effectiveness of our hybrid approach for Roman Urdu hate speech detection.

Keywords

References

  1. Sci Rep. 2022 Mar 31;12(1):5436 [PMID: 35361890]
  2. PeerJ Comput Sci. 2022 Apr 22;8:e896 [PMID: 35494831]
  3. PeerJ Comput Sci. 2022 Aug 3;8:e1053 [PMID: 36091976]
  4. Neural Comput. 1997 Nov 15;9(8):1735-80 [PMID: 9377276]

MeSH Term

Humans
Machine Learning
Speech
Natural Language Processing
Algorithms
Hate
Deep Learning
Bayes Theorem
Language

Word Cloud

Created with Highcharts 10.0.00speechhateUrdudetectiontaskRomanmodelusinghybridlearningcorpusaccuracyF1scoreHSDtextresearchclassificationoptimizationproblemlanguagevariousautomaticcorporamodelsfeaturemachineperformancehyperparameterSearchRSHS-RU-20RUHSOLDdemonstrate95HaterapidincreaseuserssocialmediacyberbullyingproblemsarisenpastyearsAutomaticemergingnaturalprocessingNLPResearchersdevelopedapproachessolvedifferentlanguageshoweverratherscarcestudyaimsaddressTwittercontributiondevelopmentpreviouslyexplorednovelintegratesdeepDLtransformerextractioncombinedalgorithmsMLAsenhanceemployseveralHPOtechniquesincludingGridGSRandomizedBayesianOptimizationGaussianProcessesBOGPEvaluationcarriedtwopubliclyavailablebenchmarkscomprisingResultsMultilingualBERTMBERTlearnerpairedSupportVectorMachineSVMclassifieroptimizedachievesstate-of-the-artattained93Neutral-Hostile8988Speech-Offensiveachieved94Coarse-grainedalongside8784Fine-grainedresultseffectivenessapproachurduDeepModel

Similar Articles

Cited By

No available data.