Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages.

Advanced Search

Atabay Ziyaden, Amir Yelenov, Fuad Hajiyev, Samir Rustamov, Alexandr Pak

Author Information

Atabay Ziyaden: Kazakh-British Technical University, Almaty, Kazakhstan.
Amir Yelenov: Institute of Information and Computational Technologies, Almaty, Kazakhstan. ORCID
Fuad Hajiyev: School of Information Technologies and Engineering, ADA University, Baku, Azerbaijan.
Samir Rustamov: School of Information Technologies and Engineering, ADA University, Baku, Azerbaijan.
Alexandr Pak: Kazakh-British Technical University, Almaty, Kazakhstan.

PMID: 38660166 DOI: 10.7717/peerj-cs.1974

Background: In the domain of natural language processing (NLP), the development and success of advanced language models are predominantly anchored in the richness of available linguistic resources. Languages such as Azerbaijani, which is classified as a low-resource, often face challenges arising from limited labeled datasets, consequently hindering effective model training.
Methodology: The primary objective of this study was to enhance the effectiveness and generalization capabilities of news text classification models using text augmentation techniques. In this study, we solve the problem of working with low-resource languages using translations using the Facebook mBart50 model, as well as the Google Translate API and a combination of mBart50 and Google Translate thus expanding the capabilities when working with text.
Results: The experimental outcomes reveal a promising uptick in classification performance when models are trained on the augmented dataset compared with their counterparts using the original data. This investigation underscores the immense potential of combined data augmentation strategies to bolster the NLP capabilities of underrepresented languages. As a result of our research, we have published our labeled text classification dataset and pre-trained RoBERTa model for the Azerbaijani language.

Azerbaijani language Deep learning Low-resource language Machine learning Natural language processing Text augmentation Text classification

PeerJ Comput Sci. 2023 Feb 8;9:e1224 [PMID: 37346576]
PeerJ Comput Sci. 2021 Nov 16;7:e681 [PMID: 34901419]
PeerJ Comput Sci. 2022 Dec 15;8:e1128 [PMID: 37346317]
PeerJ Comput Sci. 2023 May 23;9:e1353 [PMID: 37346628]
PeerJ Comput Sci. 2023 Jan 13;9:e1176 [PMID: 37346684]

Journal Article

OpenLB
Open Library of Bioscience