Abstract
Authorship attribution aims at distinguishing texts written by different authors using text features representing their styles. In this paper we investigate stylometric features for the Polish language based on Part of Speech (POS) tagging (including POS bigrams) and function words. Due to high inflection level of Polish language the feature space tends to be very large. This in particular concerns POS n-grams. Focusing on POS bigrams, we propose their simplified representation allowing to keep the feature space compact. We report experiments, in which authorship attribution was conducted for varying in lengths documents, with use of classifiers from the Weka library. We evaluate classification results for combinations of the following features: POS tags, POS bigrams, function words and simple document statistics. Experiments indicate that the developed features provide good classification performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006). doi:10.1007/11892755_87
Eder, M.: Style-markers in authorship attribution a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6(1), 99–114 (2011)
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004. Association for Computational Linguistics, Stroudsburg (2004). http://dx.doi.org/10.3115/1220355.1220443
Juola, P.: Authorship attribution. Found. Trends Inf. Retriev. 1(3), 233–334 (2006)
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)
Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inform. Sci. Technol. 57(11), 1519–1525 (2006)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? J. Law Policy 21, 317–331 (2013)
Kuta, M., Puto, B., Kitowski, J.: Authorship attribution of polish newspaper articles. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2016. LNCS (LNAI), vol. 9693, pp. 474–483. Springer, Cham (2016). doi:10.1007/978-3-319-39384-1_41
Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55 (2011)
Miłkowski, M.: Morfologik (2016). http://morfologik.blogspot.com/. Accessed December 2016
Rybicki, J.: Success rates in most-frequent-word-based authorship attribution. A case study of 1000 polish novels from ignacy krasicki to jerzy pilch. Stud. Pol. Linguist. 10(2) (2015). http://www.ejournals.eu/SPL/2015/Issue-2/art/5409/
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)
Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man–Machine Interactions 4. AISC, vol. 391, pp. 535–547. Springer, Cham (2016). doi:10.1007/978-3-319-23437-3_46
Szwed, P.: Concepts extraction from unstructured Polish texts: a rule based approach. In: 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 355–364, September 2015
Szwed, P.: Enhancing concept extraction from polish texts with rule management. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 341–356. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_27
Szwed, P.: Authorship attribution for polish texts based on part of speech tagging. In: Mrozek, D., Kozielski, S., Malysiak-Mrozek, B., Kasprowski, P., Kostrzewa, D. (eds.) Proceedings of the 12th International Conference on Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery, BDAS 2017, Ustroń, Poland, 30 May–2 June 2017 (2017, to appear)
Wolinski, M., Milkowski, M., Ogrodniczuk, M., Przepiórkowski, A.: Polimorf: a (not so) new open morphological dictionary for polish. In: LREC, pp. 860–864 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Szwed, P. (2017). Stylometric Features for Authorship Attribution of Polish Texts. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2017. Lecture Notes in Computer Science(), vol 10246. Springer, Cham. https://doi.org/10.1007/978-3-319-59060-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-59060-8_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59059-2
Online ISBN: 978-3-319-59060-8
eBook Packages: Computer ScienceComputer Science (R0)