Roman Urdu Hate Speech Detection Using Transformer-Based Model for Cyber Security Applications.

Advanced Search

Muhammad Bilal, Atif Khan, Salman Jan, Shahrulniza Musa, Shaukat Ali

Author Information

Muhammad Bilal: Department of Computer Science, Islamia College Peshawar, Peshawar 25130, Pakistan.
Atif Khan: Department of Computer Science, Islamia College Peshawar, Peshawar 25130, Pakistan.
Salman Jan: Malaysian Institute of Information Technology, Universiti Kuala Lumpur, Kuala Lumpur 50250, Malaysia. ORCID
Shahrulniza Musa: Malaysian Institute of Information Technology, Universiti Kuala Lumpur, Kuala Lumpur 50250, Malaysia. ORCID
Shaukat Ali: Department of Computer Science, Islamia College Peshawar, Peshawar 25130, Pakistan.

PMID: 37112249 DOI: 10.3390/s23083909

Social media applications, such as Twitter and Facebook, allow users to communicate and share their thoughts, status updates, opinions, photographs, and videos around the globe. Unfortunately, some people utilize these platforms to disseminate hate speech and abusive language. The growth of hate speech may result in hate crimes, cyber violence, and substantial harm to cyberspace, physical security, and social safety. As a result, hate speech detection is a critical issue for both cyberspace and physical society, necessitating the development of a robust application capable of detecting and combating it in real-time. Hate speech detection is a context-dependent problem that requires context-aware mechanisms for resolution. In this study, we employed a transformer-based model for Roman Urdu hate speech classification due to its ability to capture the text context. In addition, we developed the first Roman Urdu pre-trained BERT model, which we named BERT-RU. For this purpose, we exploited the capabilities of BERT by training it from scratch on the largest Roman Urdu dataset consisting of 173,714 text messages. Traditional and deep learning models were used as baseline models, including LSTM, BiLSTM, BiLSTM + Attention Layer, and CNN. We also investigated the concept of transfer learning by using pre-trained BERT embeddings in conjunction with deep learning models. The performance of each model was evaluated in terms of accuracy, precision, recall, and F-measure. The generalization of each model was evaluated on a cross-domain dataset. The experimental results revealed that the transformer-based model, when directly applied to the classification task of the Roman Urdu hate speech, outperformed traditional machine learning, deep learning models, and pre-trained transformer-based models in terms of accuracy, precision, recall, and F-measure, with scores of 96.70%, 97.25%, 96.74%, and 97.89%, respectively. In addition, the transformer-based model exhibited superior generalization on a cross-domain dataset.

BERT BiLSTM CNN LSTM Roman Urdu cyber security deep learning hate speech natural language processing (NLP) social media transformer models

PLoS One. 2019 Aug 20;14(8):e0221152 [PMID: 31430308]
PLoS One. 2020 Aug 27;15(8):e0237861 [PMID: 32853205]
Sci Rep. 2022 Mar 31;12(1):5436 [PMID: 35361890]

Humans

Hate

Speech

Awareness

Computer Security

Language

Journal Article

No available data.

OpenLB
Open Library of Bioscience