ABSTRACT
Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system. We also provide empirical evidence that the superiority of our approach, beyond a 10-million-word threshold, is essentially domain- and corpus-size-independent.
- D. A. Benson, M. S. Boguski, D. J. Lipman, J. Ostell, F. B. Ouellette, B. A. Rapp, and D. L. Wheeler. Genbank. Nucleic Acids Research, 27(1):12--17, 1999.]]Google ScholarCross Ref
- O. Bodenreider, T. C. Rindflesch, and A. Burgun. Unsupervised, corpus-based method for extending a biomedical terminology. In Proceedings of the ACL Workshop on Natural Language Processing in the Biomedical Domain., pages 53--60. Pittsburgh, PA, USA. Association for Computational Linguistics, 2002.]] Google ScholarDigital Library
- A. C. Browne, G. Divita, V. Nguyen, and V. C. Cheng. Modular text processing system based on the Specialist lexicon and lexical tools. In C. G. Chute, editor, AMIA'98 -- Proceedings of the 1998 AMIA Annual Fall Symposium. A Paradigm Shift in Health Care Information Systems: Clinical Infrastructures for the 21st Century, page 982. Orlando, FL, November 7-11, 1998. Philadelphia, PA: Hanley & Belfus, 1998.]]Google Scholar
- N. Collier, C. Nobata, and J. Tsujii. Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. Terminology, 7(2):239--257, 2002.]]Google ScholarCross Ref
- F. J. Damerau. Generating and evaluating domain-oriented multi-word terms from text. Information Processing & Management, 29(4):433--447, 1993.]] Google ScholarDigital Library
- S. Evert and B. Krenn. Methods for the qualitative evaluation of lexical association measures. In ACL'01 -- Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 188--195. Toulouse, France, 2001.]] Google ScholarDigital Library
- K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word-terms: the C/NC value method. International Journal of Digital Libraries, 3(2):115--130, 2000.]]Google ScholarCross Ref
- Gene Ontology Consortium. Creating the Gene Ontology resource: Design and implementation. Genome Research, 11(8):1425--1433, 2001.]]Google ScholarCross Ref
- W. R. Hersh, E. Campbell, D. Evans, and N. Brownlow. Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. In J. J. Cimino, editor, AMIA'96 -- Proceedings of the 1996 AMIA Annual Fall Symposium (formerly SCAMC). Beyond the Superhighway: Exploiting the Internet with Medical Informatics, pages 159--163. Washington, D.C., October 26-30, 1996. Philadelphia, PA: Hanley & Belfus, 1996.]]Google Scholar
- J. S. Justeson and S. M. Katz. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1):9--27, 1995.]]Google ScholarCross Ref
- M. Krauthammer and G. Nenadić. Term idenfification in the biomedical literature. Journal of Biomedical Informatics, 37(6):512--526, 2004.]] Google ScholarDigital Library
- T. Kudo and Y. Matsumoto. Chunking with support vector machines. In NAACL'01, Language Technologies 2001 -- Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pages 192--199. Pittsburgh, PA, USA, June 2-7, 2001, 2001.]] Google ScholarDigital Library
- C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. Cambridge, MA; London, U.K.: Bradford Book & MIT Press, 1999.]] Google ScholarDigital Library
- H. Mima, S. Ananiadou, and G. Nenadić. The Atract workbench: Automatic term recognition and clustering of terms. In V. Matusek, editor, Text, Speech and Dialog (TSD 2001), volume 2166 of Lecture Notes in Artificial Intelligence, pages 126--133. Berlin: Springer, 2001.]] Google ScholarDigital Library
- H. Nakagawa and T. Mori. Nested collocation and compound noun for term recognition. In COMPUTERM '98 -- Proceedings of the First Workshop on Comutational Terminology, pages 64--70, 1998.]]Google Scholar
- G. Nenadić, S. Ananiadou, and J. McNaught. Enhancing automatic term recognition through recognition of variation. In COLING 2004 -- Proceedings of the 20th International Conference on Computational Linguistics, pages 604--610. Association for Computational Linguistics, 2004.]] Google ScholarDigital Library
- G. Nenadić, I. Spasic, and S. Ananiadou. Terminology-driven mining of biomedical literature. Journal of Biomedical Informatics, 33:1--6, 2003.]]Google Scholar
- T. C. Rindflesch, L. Hunter, and A. R. Aronson. Mining molecular binding terminology from biomedical text. In N. M. Lorenzi, editor, AMIA'99 -- Proceedings of the 1999 Annual Symposium of the American Medical Informatics Association. Transforming Health Care through Informatics: Cornerstones for a New Information Management Paradigm, pages 127--131. Washington, D.C., November 6-10, 1999. Philadelphia, PA: Hanley & Belfus, 1999.]]Google Scholar
- L. Sachs. Applied Statistics: A Handbook of Techniques. New York: Springer, 2nd edition, 1984.]]Google Scholar
- MeSH. Medical Subject Headings. Bethesda, MD: National Library of Medicine, 2004.]]Google Scholar
- Umls. Unified Medical Language System. Bethesda, MD: National Library of Medicine, 2004.]]Google Scholar
- J. Wermter and U. Hahn. Collocation extraction based on modifiability statistics. In COLING Geneva 2004 -- Proceedings of the 20th International Conference on Computational Linguistics, volume 2, pages 980--986. Geneva, Switzerland, August 23-27, 2004. Association for Computational Linguistics, 2004.]] Google ScholarDigital Library
Index Terms
- Finding new terminology in very large corpora
Recommendations
An electronic dictionary of computer science terminology
ICCOMP'06: Proceedings of the 10th WSEAS international conference on ComputersAutomatic text analysis systems can lexically recognize a word only if it already exists in the electronic dictionary. The same thing is true for the NOOJ system analysis programs. One understands here by electronic dictionaries the lexical databases ...
Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems
Chinese word segmentation is an essential step in a processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is widely adopted, ...
Morphosyntactic Parser and Textual Corpora: Processing Uncommon Phenomena of Tibetan Language
IMS2017: Proceedings of the International Conference IMS-2017This article analyzes the problems of parsing texts with linguistic phenomena of controversial nature which may rarely be encountered in NLP projects focusing on Indo-European languages, but are quite frequent in other languages, e.g. in the corpus of ...
Comments