skip to main content
10.1145/1088622.1088648acmconferencesArticle/Chapter ViewAbstractPublication Pagesk-capConference Proceedingsconference-collections
Article

Finding new terminology in very large corpora

Authors Info & Claims
Published:02 October 2005Publication History

ABSTRACT

Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system. We also provide empirical evidence that the superiority of our approach, beyond a 10-million-word threshold, is essentially domain- and corpus-size-independent.

References

  1. D. A. Benson, M. S. Boguski, D. J. Lipman, J. Ostell, F. B. Ouellette, B. A. Rapp, and D. L. Wheeler. Genbank. Nucleic Acids Research, 27(1):12--17, 1999.]]Google ScholarGoogle ScholarCross RefCross Ref
  2. O. Bodenreider, T. C. Rindflesch, and A. Burgun. Unsupervised, corpus-based method for extending a biomedical terminology. In Proceedings of the ACL Workshop on Natural Language Processing in the Biomedical Domain., pages 53--60. Pittsburgh, PA, USA. Association for Computational Linguistics, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. C. Browne, G. Divita, V. Nguyen, and V. C. Cheng. Modular text processing system based on the Specialist lexicon and lexical tools. In C. G. Chute, editor, AMIA'98 -- Proceedings of the 1998 AMIA Annual Fall Symposium. A Paradigm Shift in Health Care Information Systems: Clinical Infrastructures for the 21st Century, page 982. Orlando, FL, November 7-11, 1998. Philadelphia, PA: Hanley & Belfus, 1998.]]Google ScholarGoogle Scholar
  4. N. Collier, C. Nobata, and J. Tsujii. Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. Terminology, 7(2):239--257, 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  5. F. J. Damerau. Generating and evaluating domain-oriented multi-word terms from text. Information Processing & Management, 29(4):433--447, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Evert and B. Krenn. Methods for the qualitative evaluation of lexical association measures. In ACL'01 -- Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 188--195. Toulouse, France, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word-terms: the C/NC value method. International Journal of Digital Libraries, 3(2):115--130, 2000.]]Google ScholarGoogle ScholarCross RefCross Ref
  8. Gene Ontology Consortium. Creating the Gene Ontology resource: Design and implementation. Genome Research, 11(8):1425--1433, 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  9. W. R. Hersh, E. Campbell, D. Evans, and N. Brownlow. Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. In J. J. Cimino, editor, AMIA'96 -- Proceedings of the 1996 AMIA Annual Fall Symposium (formerly SCAMC). Beyond the Superhighway: Exploiting the Internet with Medical Informatics, pages 159--163. Washington, D.C., October 26-30, 1996. Philadelphia, PA: Hanley & Belfus, 1996.]]Google ScholarGoogle Scholar
  10. J. S. Justeson and S. M. Katz. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1):9--27, 1995.]]Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Krauthammer and G. Nenadić. Term idenfification in the biomedical literature. Journal of Biomedical Informatics, 37(6):512--526, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Kudo and Y. Matsumoto. Chunking with support vector machines. In NAACL'01, Language Technologies 2001 -- Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pages 192--199. Pittsburgh, PA, USA, June 2-7, 2001, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. Cambridge, MA; London, U.K.: Bradford Book & MIT Press, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Mima, S. Ananiadou, and G. Nenadić. The Atract workbench: Automatic term recognition and clustering of terms. In V. Matusek, editor, Text, Speech and Dialog (TSD 2001), volume 2166 of Lecture Notes in Artificial Intelligence, pages 126--133. Berlin: Springer, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Nakagawa and T. Mori. Nested collocation and compound noun for term recognition. In COMPUTERM '98 -- Proceedings of the First Workshop on Comutational Terminology, pages 64--70, 1998.]]Google ScholarGoogle Scholar
  16. G. Nenadić, S. Ananiadou, and J. McNaught. Enhancing automatic term recognition through recognition of variation. In COLING 2004 -- Proceedings of the 20th International Conference on Computational Linguistics, pages 604--610. Association for Computational Linguistics, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Nenadić, I. Spasic, and S. Ananiadou. Terminology-driven mining of biomedical literature. Journal of Biomedical Informatics, 33:1--6, 2003.]]Google ScholarGoogle Scholar
  18. T. C. Rindflesch, L. Hunter, and A. R. Aronson. Mining molecular binding terminology from biomedical text. In N. M. Lorenzi, editor, AMIA'99 -- Proceedings of the 1999 Annual Symposium of the American Medical Informatics Association. Transforming Health Care through Informatics: Cornerstones for a New Information Management Paradigm, pages 127--131. Washington, D.C., November 6-10, 1999. Philadelphia, PA: Hanley & Belfus, 1999.]]Google ScholarGoogle Scholar
  19. L. Sachs. Applied Statistics: A Handbook of Techniques. New York: Springer, 2nd edition, 1984.]]Google ScholarGoogle Scholar
  20. MeSH. Medical Subject Headings. Bethesda, MD: National Library of Medicine, 2004.]]Google ScholarGoogle Scholar
  21. Umls. Unified Medical Language System. Bethesda, MD: National Library of Medicine, 2004.]]Google ScholarGoogle Scholar
  22. J. Wermter and U. Hahn. Collocation extraction based on modifiability statistics. In COLING Geneva 2004 -- Proceedings of the 20th International Conference on Computational Linguistics, volume 2, pages 980--986. Geneva, Switzerland, August 23-27, 2004. Association for Computational Linguistics, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Finding new terminology in very large corpora

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture
          October 2005
          234 pages
          ISBN:1595931635
          DOI:10.1145/1088622

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 October 2005

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate55of198submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader