Event classification from the Urdu language text on social media.

Malik Daler Ali Awan, Nadeem Iqbal Kajla, Amnah Firdous, Mujtaba Husnain, Malik Muhammad Saad Missen
Author Information
  1. Malik Daler Ali Awan: Department of Software Engineering, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, Punjab, Pakistan.
  2. Nadeem Iqbal Kajla: Department of Computer Science, Muhammad Nawaz Sharif University of Agriculture, Multan, Multan, Punjab, Pakistan. ORCID
  3. Amnah Firdous: Computer Science and Information Technology, The Govt. Sadiq College and Women University Bahawalpur, Bahawalpur, Punjab, Pakistan.
  4. Mujtaba Husnain: Department of Information Technology, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, Punjab, Pakistan. ORCID
  5. Malik Muhammad Saad Missen: Department of Information Technology, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, Punjab, Pakistan.

Abstract

The real-time availability of the Internet has engaged millions of users around the world. The usage of regional languages is being preferred for effective and ease of communication that is causing multilingual data on social networks and news channels. People share ideas, opinions, and events that are happening globally ., sports, inflation, protest, explosion, and sexual assault, . in regional (local) languages on social media. Extraction and classification of events from multilingual data have become bottlenecks because of resource lacking. In this research paper, we presented the event classification task for the Urdu language text existing on social media and the news channels by using machine learning classifiers. The dataset contains more than 0.1 million (102,962) labeled instances of twelve (12) different types of events. The title, its length, and the last four words of a sentence are used as features to classify the events. The Term Frequency-Inverse Document Frequency () showed the best results as a feature vector to evaluate the performance of the six popular machine learning classifiers. Random Forest (RF) and K-Nearest Neighbor (KNN) are among the classifiers that out-performed among other classifiers by achieving 98.00% and 99.00% accuracy, respectively. The novelty lies in the fact that the features aforementioned are not applied, up to the best of our knowledge, in the event extraction of the text written in the Urdu language.

Keywords

References

  1. J Biomed Inform. 2015 Feb;53:196-207 [PMID: 25451103]

Word Cloud

Created with Highcharts 10.0.0socialeventsmediaclassificationlanguageclassifiersUrdutextlearningregionallanguagesmultilingualdatanewschannelseventmachinefeaturesbestamong00%Eventreal-timeavailabilityInternetengagedmillionsusersaroundworldusagepreferredeffectiveeasecommunicationcausingnetworksPeopleshareideasopinionshappeninggloballysportsinflationprotestexplosionsexualassaultlocalExtractionbecomebottlenecksresourcelackingresearchpaperpresentedtaskexistingusingdatasetcontains01million102962labeledinstancestwelve12differenttypestitlelengthlastfourwordssentenceusedclassifyTermFrequency-InverseDocumentFrequencyshowedresultsfeaturevectorevaluateperformancesixpopularRandomForestRFK-NearestNeighborKNNout-performedachieving9899accuracyrespectivelynoveltyliesfactaforementionedappliedknowledgeextractionwrittenInformationretrievalMachineNaturalprocessingResourcepoorSocialText

Similar Articles

Cited By