Preview

Public Administration. E-journal (Russia)

Advanced search

Development of а System of Linguistic Markers for Automated Unloading of Thematic Text Data from а Social Network

https://doi.org/10.24412/2070-1381-2023-97-70-84

Abstract

Automated search and selection of texts on a specific topic in the target source to form a representative thematic text collection (text dataset) of large dimensions, being a special case of obtaining and structuring primary data, remains one of the most demanded applied tasks of natural language processing. The article presents the experience of developing a system of linguistic markers that allows automated extraction of texts related to the topic of vaccination against COVID-19 on the material of the VKontakte social network. A combination of linguistic methods with methods for collecting and processing text data allows forming the final dataset. The test list of markers forms is based on background knowledge, work with dictionaries and special linguistic services. The task was to create a list of words united by a common conceptual feature, to predict the joint occurrence of words in texts about vaccination against COVID-19, or to find specific words that mark this topic: occasionalisms, designations of specific realities. The content of the VKontakte thematic communities uploaded using the test list of markers became the source of automated and expert extraction of the main array of markers (354 units). The procedure for automated filtering of an intermediate text sample (12.8 million texts) is in detail. The technique of formation of stop-words is given. For the period from 01.01.2020 to 03.01.2023, 4.5 million relevant messages were retrieved; the validity of the markers was confirmed by an insignificant amount of noise on the scale of big data. The general principles of preparing linguistic markers for automated unloading of large text data are systematized; the strengths and weaknesses of this tool are noted; recommendations for the formation of a list of linguistic markers are suggested.

About the Authors

A. Yu. Sarkisova
Lomonosov Moscow State University
Russian Federation

Anna Yu. Sarkisova, PhD, Associate Professor, Research Associate, School of Public Administration

Moscow



E. Yu. Petrov
National Research Tomsk State University
Russian Federation

Evgeny Yu. Petrov, Technician, Supercomputer Center

Tomsk



D. O. Dunaeva
Lomonosov Moscow State University
Russian Federation

Daria O. Dunaeva, Research Associate, School of Public Administration

Moscow



References

1. Ahmad S., Asghar M.Z., Alotaibi F.M., Awan I.(2019) Detection and Classification of Social Media-Based Extremist Affiliations Using Sentiment Analysis Techniques. Human-centric Computing and Information Sciences. Vol. 9. DOI: 10.1186/s13673-019-0185-6

2. Cohen K., Johansson F., Kaati L., Clausen Mork J.C. (2014) Detecting Linguistic Markers for Radical Violence in Social Media. Terrorism and Political Violence. Vol. 26. Is. 1. P. 246–256. DOI: 10.1080/09546553.2014.849948

3. Deng W., Hsu J.-H., Löfgren K., Cho W.(2021) Who Is Leading China’s Family Planning Policy Discourse in Weibo? A Social Media Text Mining Analysis. Policy & Internet. Vol. 13. Is. 4. P. 485–501. DOI: 10.1002/poi3.264

4. Erseghe T., Badia L., Dzanko L., Suitner C. (2022) PLMP: A Method to map the linguistic markers of the social discourse onto its semantic network. 2022 IEEE / ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). November 10–13, 2022, Istanbul, Turkey. 2022. Istanbul: Institute of Electrical and Electronics Engineers. P. 247–251. DOI: 10.1109/ASONAM55673.2022.10068643

5. Huang F., Ding H., Liu Z., Wu P., Zhu M., Li A., Zhu T. (2020) How Fear and Collectivism Influence Public’s Preventive Intention towards COVID-19 Infection: A Study Based on Big Data from the Social Media. BMC Public Health. Vol. 20. DOI: 10.1186/s12889-020-09674-6

6. Huh J-H. (2018) Big Data Analysis for Personalized Health Activities: Machine Learning Processing for Automatic Keyword Extraction Approach. Symmetry. Vol. 10. Is. 4. DOI: 10.3390/sym10040093

7. Gornostaeva Yu.A. (2018) Attempt of Identifying Verbal Markers of Psychological and Cognitive Processes in Linguistics: On the Issue History. Filologicheskie nauki. Voprosy teorii i praktiki. No. 8(86). Part 1. P. 91–94. DOI: 10.30853/filnauki.2018-8-1.21

8. Karpova A.Yu., Savelev A.O., Vilnin A.D., Chaykovskiy D.V. (2020) Studying Online Radicalization of Youth through Social Media (Interdisciplinary Approach). Monitoring obshchestvennogo mneniya: ekonomicheskiye i sotsial’nyye peremeny. No. 3. P. 159–181. DOI: 10.14515/monitoring.2020.3.1585

9. Kessel R. van, Kyriopoulos I., Wong B.L.H., Mossialos E. (2023) The Effect of the COVID-19 Pandemic on Digital Health–Seeking Behavior: Big Data Interrupted Time-Series Analysis of Google Trends. Journal of Medical Internet Research. Vol. 25. DOI: 10.2196/42401

10. Kolmogorova A.V., Taldykina Yu.A., Kalinin A.A. (2016) Linguistic Markers of Manipulation in Polarized Discourse: Parametric Study. Politicheskaya lingvistika. No. 4(58). P. 194–199.

11. Kolmogorova A.V., Kalinin A.A., Malikova A.V. (2019) The Types and Combinatorics of Verbal Markers of Different Emotional Tonalities in Russian-Language Internet Texts. Vestnik Tomskogo gosudarstvennogo universiteta. No. 448. P. 48–58. DOI: 10.17223/15617793/448/6

12. Kontsevoy M.P. (2022) Onlaynovyye semanticheskiye vychisleniya na platforme RusVectōrēs v prepodavanii komp’yuternoy lingvistiki [Online semantic calculations on the RusVectōrēs platform in teaching computational linguistics]. Distantsionnoye obucheniye — obrazovatel’naya sreda XXI veka: materialy XII Mezhdunarodnoy nauchno-metodicheskoy konferentsii. Minsk, May 26, 2022. Minsk: BGUIR. P. 75.

13. Liu T., Giorgi S., Yadeta K., Schwarts H.A., Ungar L.H., Curtis B. (2022) Linguistic Predictors from Facebook Postings of Substance Use Disorder Treatment Retention versus Discontinuation. The American Journal of Drug and Alcohol Abuse Encompassing. Vol. 48. Is. 5. P. 573–585. DOI: 10.1080/00952990.2022.2091450

14. Mishlanov V.A., Kadzhaya L.A., Kuznetsova Yu.M. (2020) Linguistic Markers of Emotional State of the Speech Subject (on the Problem of Automatic Monitoring of Network Communication Texts). Medialingvistika. Vol. 7. No. 4. P. 428–444. DOI: 10.21638/spbu22.2020.405

15. Petrov E.Yu., Sarkisova A.Yu. (2021) Resource of Software Platform “Polyanalyst” in Social Science and Humanities Research. Otkrytyye dannyye — 2021: materialy foruma. Ed. by A.Yu. Sarkisova. Tomsk: Izdatel’stvo Tomskogo gosudarstvennogo universiteta. P. 94–104.

16. Sboev A.G., Gudovskikh D.V., Moloshnikov I.A., Kukin K.A., Rybka R.B., Ivanov I.I., Vlasov D.S. (2013) Avtomaticheskoye vydeleniye psikholingvisticheskikh kharakteristik tekstov v ramkakh kontseptsii Big Data [Automatic selection of psycholinguistic characteristics of texts within the concept of Big Data]. Sovremennye informacionnye tehnologii i IT-obrazovanie. No. 9. P. 433–438.

17. Shchekotin E.V., Goiko V.L., Myagkov M.G., Dunaeva D.O. (2021) Assessment of Quality of Life in Regions of Russia Based on Social Media Data. Journal of Eurasian Studies. Vol. 12. No. 2. DOI: 10.1177/18793665211034185


Review

For citations:


Sarkisova A.Yu., Petrov E.Yu., Dunaeva D.O. Development of а System of Linguistic Markers for Automated Unloading of Thematic Text Data from а Social Network. Public Administration. E-journal (Russia). 2023;(97):70-84. (In Russ.) https://doi.org/10.24412/2070-1381-2023-97-70-84

Views: 49

JATS XML


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2070-1381 (Online)