RuSentiTweet: a sentiment analysis dataset of general domain tweets in Russian.

Sergey Smetanin
Author Information
  1. Sergey Smetanin: Department of Business Informatics, Graduate School of Business, National Research University Higher School of Economics, Russia. ORCID

Abstract

The Russian language is still not as well-resourced as English, especially in the field of sentiment analysis of Twitter content. Though several sentiment analysis datasets of tweets in Russia exist, they all are either automatically annotated or manually annotated by one annotator. Thus, there is no inter-annotator agreement, or annotation may be focused on a specific domain. In this article, we present RuSentiTweet, a new sentiment analysis dataset of general domain tweets in Russian. RuSentiTweet is currently the largest in its class for Russian, with 13,392 tweets manually annotated with moderate inter-rater agreement into five classes: Positive, Neutral, Negative, Speech Act, and Skip. As a source of data, we used Twitter Stream Grab, a historical collection of tweets obtained from the general Twitter API stream, which provides a 1% sample of the public tweets. Additionally, we released a RuBERT-based sentiment classification model that achieved = 0.6594 on the test subset.

Keywords

References

  1. PLoS One. 2016 May 05;11(5):e0155036 [PMID: 27149621]
  2. Brain Neurosci Adv. 2017 Jan 01;1:2398212817744501 [PMID: 29270466]
  3. IEEE J Biomed Health Inform. 2015 Jul;19(4):1246-52 [PMID: 25700477]
  4. Neural Netw. 2022 Jun;150:392-407 [PMID: 35358887]
  5. Biometrics. 1977 Mar;33(1):159-74 [PMID: 843571]
  6. PLoS One. 2018 Apr 25;13(4):e0195750 [PMID: 29694424]
  7. Sci Rep. 2021 Dec 8;11(1):23705 [PMID: 34880354]
  8. BMC Med Res Methodol. 2018 Nov 19;18(1):141 [PMID: 30453897]

Word Cloud

Created with Highcharts 10.0.0tweetsRussiansentimentanalysisTwitterannotateddomaindatasetgeneralmanuallyagreementRuSentiTweetSentimentlanguagestillwell-resourcedEnglishespeciallyfieldcontentThoughseveraldatasetsRussiaexisteitherautomaticallyoneannotatorThusinter-annotatorannotationmayfocusedspecificarticlepresentnewcurrentlylargestclass13392moderateinter-raterfiveclasses:PositiveNeutralNegativeSpeechActSkipsourcedatausedStreamGrabhistoricalcollectionobtainedAPIstreamprovides1%samplepublicAdditionallyreleasedRuBERT-basedclassificationmodelachieved=06594testsubsetRuSentiTweet:

Similar Articles

Cited By