Skip to main content

Applying Authorship Analysis to Arabic Web Content

  • Conference paper
Intelligence and Security Informatics (ISI 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3495))

Included in the following conference series:

Abstract

The advent and rapid proliferation of internet communication has allowed the realization of numerous security issues. The anonymous nature of online mediums such as email, web sites, and forums provides an attractive communication method for criminal activity. Increased globalization and the boundless nature of the internet have further amplified these concerns due to the addition of a multilingual dimension. The world’s social and political climate has caused Arabic to draw a great deal of attention. In this study we apply authorship identification techniques to Arabic web forum messages. Our research uses lexical, syntactic, structural, and content-specific writing style features for authorship identification. We address some of the problematic characteristics of Arabic in route to the development of an Arabic language model that provides a respectable level of classification accuracy for authorship discrimination. We also run experiments to evaluate the effectiveness of different feature types and classification techniques on our dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adamson, G.W., Boreham, J.: The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information Storage and Retrieval 10, 253–260 (1974)

    Article  Google Scholar 

  2. Al-Fedaghi, S.S., Al-Anzi, F.: A new algorithm to generate Arabic root-pattern forms. In: Proceedings of the 11th National Computer Conference, pp. 4–7. King Fahd University of Petroleum & Minerals, Dhahran (1989)

    Google Scholar 

  3. Baayen, H., Halteren, H.v., Neijt, A., Tweedie, F.: An experiment in authorship attribution. Paper presented at the Proceedings of the 6th International Conference on the Statistical Analysis of Textual Data, JADT 2002 (2002)

    Google Scholar 

  4. Beesley, K.B.: Arabic Finite-State Morphological Analysis and Generation. In: Proceedings of COLING 1996, pp. 89–94 (1996)

    Google Scholar 

  5. Burrows, J.F.: Word patterns and story shapes: the statistical analysis of narrative style. Literary and Linguistic Computing 2, 61–67 (1987)

    Article  Google Scholar 

  6. Chen, H., Shankaranarayanan, G., Iyer, A., She, L.: A machine learning approach to inductive query by examples: an experiment using relevance feedback, ID3, Genetic Algorithms, and Simulated Annealing. Journal of the American Society for Information Science 49(8), 693–705 (1998)

    Article  Google Scholar 

  7. De Roeck, A.N., Al-Fares, W.: A morphologically sensitive clustering algorithm for identifying Arabic roots. In: Proceedings ACL 2000, Hong Kong (2000)

    Google Scholar 

  8. De Vel, O.: Mining E-mail authorship. Paper presented at the Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining, KDD 2000 (2000)

    Google Scholar 

  9. De Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail content for author identification forensics. SIGMOD Record 30(4), 55–64 (2001)

    Article  Google Scholar 

  10. Diab, M., Hacioglu, K., Jurafsky, D.: Automatic Tagging of Arabic Text: From raw text to Base Phrase Chunks. In: Proceedings of HLT-NAACL 2004 (2004)

    Google Scholar 

  11. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with Support Vector Machines. Applied Intelligence (2000)

    Google Scholar 

  12. Dietterich, T.G., Hild, H., Bakiri, G.: A comparative study of ID3 and Backpropagation for English Text-to-Speech mapping. Machine Learning, 24–31 (1990)

    Google Scholar 

  13. Forsyth, R.S., Holmes, D.I.: Feature finding for text classification. Literary and Linguistic Computing 11(4) (1996)

    Google Scholar 

  14. Hmeidi, I., Kanaan, G., Evens, M.: Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents. Journal of the American Society for Information Science 48(10), 867–881 (1997)

    Article  Google Scholar 

  15. Holmes, D.I.: A stylometric analysis of Mormon Scripture and related texts. Journal of Royal Statistical Society 155, 91–120 (1992)

    Article  Google Scholar 

  16. Holmes, D.I.: The evolution of stylometry in humanities. Literary and Linguistic Computing 13(3), 111–117 (1998)

    Article  Google Scholar 

  17. Hoorn, J.F., Frank, S.L., Kowalczyk, W., Ham, F.V.D.: Neural network identification of poets using letter sequences. Literary and Linguistic Computing 14(3), 311–338 (1999)

    Article  Google Scholar 

  18. Larkey, L.S., Connell, M.E.: Arabic information retrieval at UMass in TREC-10. In: TREC 2001. NIST, Gaithersburg (2001)

    Google Scholar 

  19. Ledger, G.R., Merriam, T.V.N.: Shakespeare, Fletcher, and the two Noble Kinsmen. Literary and Linguistic Computing 9, 235–248 (1994)

    Article  Google Scholar 

  20. Lowe, D., Matthews, R.: Shakespeare vs. Fletcher: a stylometric analysis by radial basis functions. Computers and the Humanities 29, 449–461 (1995)

    Article  Google Scholar 

  21. Martindale, C., McKenzie, D.: On the utility of content analysis in author attribution: The Federalist. Computer and the Humanities 29, 259–270 (1995)

    Article  Google Scholar 

  22. Mealand, D.L.: Correspondence analysis of Luke. Literary and Linguistic Computing 10, 171–182 (1995)

    Article  Google Scholar 

  23. Mendenhall, T.C.: The characteristic curves of composition. Science 11(11), 237–249 (1887)

    Article  Google Scholar 

  24. Mosteller, F., Frederick, Wallace, D.L.: Applied Bayesian and classical inference: the case of the Federalist papers, 2nd edn. Springer, Heidelberg (1964)

    Google Scholar 

  25. Mosteller, F., Wallace, D.L.: Inference and disputed authorship: the Federalist. Addison-Wesley, Reading (1964)

    MATH  Google Scholar 

  26. Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Automated authorship attribution with character level language models. Paper presented at the 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2003 (2003)

    Google Scholar 

  27. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)

    Google Scholar 

  28. Rudman, J.: The state of authorship attribution studies: some problems and solutions. Computers and the Humanities 31, 351–365 (1998)

    Article  Google Scholar 

  29. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)

    Article  Google Scholar 

  30. Tweedie, F.J., Singh, S., Holmes, D.I.: Neural Network applications in stylometry: the Federalist papers. Computers and the Humanities 30(1), 1–10 (1996)

    Article  Google Scholar 

  31. Van Rijsbergen, C.J.: Information retrieval. Butterworths, London (1979)

    Google Scholar 

  32. Vapnik, V.: The nature of statistical learning theory. Springer, New York (1995)

    MATH  Google Scholar 

  33. Yule, G.U.: On sentence length as a statistical characteristic of style in prose. Biometrika 30 (1938)

    Google Scholar 

  34. Yule, G.U.: The statistical study of literary vocabulary. Cambridge University Press, Cambridge (1944)

    Google Scholar 

  35. Zheng, R., Qin, Y., Huang, Z., Chen, H.: Authorship Analysis in Cybercrime Investigation. Paper presented at the Proceedings of the first NSF/NIJ Symposium, ISI 2003, Tucson, AZ, USA (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Abbasi, A., Chen, H. (2005). Applying Authorship Analysis to Arabic Web Content. In: Kantor, P., et al. Intelligence and Security Informatics. ISI 2005. Lecture Notes in Computer Science, vol 3495. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11427995_15

Download citation

  • DOI: https://doi.org/10.1007/11427995_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25999-2

  • Online ISBN: 978-3-540-32063-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics