Development of а System of Linguistic Markers for Automated Unloading of Thematic Text Data from а Social Network
https://doi.org/10.24412/2070-1381-2023-97-70-84
Abstract
Automated search and selection of texts on a specific topic in the target source to form a representative thematic text collection (text dataset) of large dimensions, being a special case of obtaining and structuring primary data, remains one of the most demanded applied tasks of natural language processing. The article presents the experience of developing a system of linguistic markers that allows automated extraction of texts related to the topic of vaccination against COVID-19 on the material of the VKontakte social network. A combination of linguistic methods with methods for collecting and processing text data allows forming the final dataset. The test list of markers forms is based on background knowledge, work with dictionaries and special linguistic services. The task was to create a list of words united by a common conceptual feature, to predict the joint occurrence of words in texts about vaccination against COVID-19, or to find specific words that mark this topic: occasionalisms, designations of specific realities. The content of the VKontakte thematic communities uploaded using the test list of markers became the source of automated and expert extraction of the main array of markers (354 units). The procedure for automated filtering of an intermediate text sample (12.8 million texts) is in detail. The technique of formation of stop-words is given. For the period from 01.01.2020 to 03.01.2023, 4.5 million relevant messages were retrieved; the validity of the markers was confirmed by an insignificant amount of noise on the scale of big data. The general principles of preparing linguistic markers for automated unloading of large text data are systematized; the strengths and weaknesses of this tool are noted; recommendations for the formation of a list of linguistic markers are suggested.
Keywords
About the Authors
A. Yu. SarkisovaRussian Federation
Anna Yu. Sarkisova, PhD, Associate Professor, Research Associate, School of Public Administration
Moscow
E. Yu. Petrov
Russian Federation
Evgeny Yu. Petrov, Technician, Supercomputer Center
Tomsk
D. O. Dunaeva
Russian Federation
Daria O. Dunaeva, Research Associate, School of Public Administration
Moscow
References
1. Ahmad S., Asghar M.Z., Alotaibi F.M., Awan I.(2019) Detection and Classification of Social Media-Based Extremist Affiliations Using Sentiment Analysis Techniques. Human-centric Computing and Information Sciences. Vol. 9. DOI: 10.1186/s13673-019-0185-6
2. Cohen K., Johansson F., Kaati L., Clausen Mork J.C. (2014) Detecting Linguistic Markers for Radical Violence in Social Media. Terrorism and Political Violence. Vol. 26. Is. 1. P. 246–256. DOI: 10.1080/09546553.2014.849948
3. Deng W., Hsu J.-H., Löfgren K., Cho W.(2021) Who Is Leading China’s Family Planning Policy Discourse in Weibo? A Social Media Text Mining Analysis. Policy & Internet. Vol. 13. Is. 4. P. 485–501. DOI: 10.1002/poi3.264
4. Erseghe T., Badia L., Dzanko L., Suitner C. (2022) PLMP: A Method to map the linguistic markers of the social discourse onto its semantic network. 2022 IEEE / ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). November 10–13, 2022, Istanbul, Turkey. 2022. Istanbul: Institute of Electrical and Electronics Engineers. P. 247–251. DOI: 10.1109/ASONAM55673.2022.10068643
5. Huang F., Ding H., Liu Z., Wu P., Zhu M., Li A., Zhu T. (2020) How Fear and Collectivism Influence Public’s Preventive Intention towards COVID-19 Infection: A Study Based on Big Data from the Social Media. BMC Public Health. Vol. 20. DOI: 10.1186/s12889-020-09674-6
6. Huh J-H. (2018) Big Data Analysis for Personalized Health Activities: Machine Learning Processing for Automatic Keyword Extraction Approach. Symmetry. Vol. 10. Is. 4. DOI: 10.3390/sym10040093
7. Gornostaeva Yu.A. (2018) Attempt of Identifying Verbal Markers of Psychological and Cognitive Processes in Linguistics: On the Issue History. Filologicheskie nauki. Voprosy teorii i praktiki. No. 8(86). Part 1. P. 91–94. DOI: 10.30853/filnauki.2018-8-1.21
8. Karpova A.Yu., Savelev A.O., Vilnin A.D., Chaykovskiy D.V. (2020) Studying Online Radicalization of Youth through Social Media (Interdisciplinary Approach). Monitoring obshchestvennogo mneniya: ekonomicheskiye i sotsial’nyye peremeny. No. 3. P. 159–181. DOI: 10.14515/monitoring.2020.3.1585
9. Kessel R. van, Kyriopoulos I., Wong B.L.H., Mossialos E. (2023) The Effect of the COVID-19 Pandemic on Digital Health–Seeking Behavior: Big Data Interrupted Time-Series Analysis of Google Trends. Journal of Medical Internet Research. Vol. 25. DOI: 10.2196/42401
10. Kolmogorova A.V., Taldykina Yu.A., Kalinin A.A. (2016) Linguistic Markers of Manipulation in Polarized Discourse: Parametric Study. Politicheskaya lingvistika. No. 4(58). P. 194–199.
11. Kolmogorova A.V., Kalinin A.A., Malikova A.V. (2019) The Types and Combinatorics of Verbal Markers of Different Emotional Tonalities in Russian-Language Internet Texts. Vestnik Tomskogo gosudarstvennogo universiteta. No. 448. P. 48–58. DOI: 10.17223/15617793/448/6
12. Kontsevoy M.P. (2022) Onlaynovyye semanticheskiye vychisleniya na platforme RusVectōrēs v prepodavanii komp’yuternoy lingvistiki [Online semantic calculations on the RusVectōrēs platform in teaching computational linguistics]. Distantsionnoye obucheniye — obrazovatel’naya sreda XXI veka: materialy XII Mezhdunarodnoy nauchno-metodicheskoy konferentsii. Minsk, May 26, 2022. Minsk: BGUIR. P. 75.
13. Liu T., Giorgi S., Yadeta K., Schwarts H.A., Ungar L.H., Curtis B. (2022) Linguistic Predictors from Facebook Postings of Substance Use Disorder Treatment Retention versus Discontinuation. The American Journal of Drug and Alcohol Abuse Encompassing. Vol. 48. Is. 5. P. 573–585. DOI: 10.1080/00952990.2022.2091450
14. Mishlanov V.A., Kadzhaya L.A., Kuznetsova Yu.M. (2020) Linguistic Markers of Emotional State of the Speech Subject (on the Problem of Automatic Monitoring of Network Communication Texts). Medialingvistika. Vol. 7. No. 4. P. 428–444. DOI: 10.21638/spbu22.2020.405
15. Petrov E.Yu., Sarkisova A.Yu. (2021) Resource of Software Platform “Polyanalyst” in Social Science and Humanities Research. Otkrytyye dannyye — 2021: materialy foruma. Ed. by A.Yu. Sarkisova. Tomsk: Izdatel’stvo Tomskogo gosudarstvennogo universiteta. P. 94–104.
16. Sboev A.G., Gudovskikh D.V., Moloshnikov I.A., Kukin K.A., Rybka R.B., Ivanov I.I., Vlasov D.S. (2013) Avtomaticheskoye vydeleniye psikholingvisticheskikh kharakteristik tekstov v ramkakh kontseptsii Big Data [Automatic selection of psycholinguistic characteristics of texts within the concept of Big Data]. Sovremennye informacionnye tehnologii i IT-obrazovanie. No. 9. P. 433–438.
17. Shchekotin E.V., Goiko V.L., Myagkov M.G., Dunaeva D.O. (2021) Assessment of Quality of Life in Regions of Russia Based on Social Media Data. Journal of Eurasian Studies. Vol. 12. No. 2. DOI: 10.1177/18793665211034185
Review
For citations:
Sarkisova A.Yu., Petrov E.Yu., Dunaeva D.O. Development of а System of Linguistic Markers for Automated Unloading of Thematic Text Data from а Social Network. Public Administration. E-journal (Russia). 2023;(97):70-84. (In Russ.) https://doi.org/10.24412/2070-1381-2023-97-70-84
JATS XML
