Article

Finding new terminology in very large corpora

Authors:
Joachim Wermter

Jena University Language and Information Engineering (JULIE) Lab, Jena, Germany

Jena University Language and Information Engineering (JULIE) Lab, Jena, Germany
View Profile

,
Udo Hahn

Jena University Language and Information Engineering (JULIE) Lab, Jena, Germany

Jena University Language and Information Engineering (JULIE) Lab, Jena, Germany
View Profile

K-CAP '05: Proceedings of the 3rd international conference on Knowledge captureOctober 2005Pages 137–144https://doi.org/10.1145/1088622.1088648

Published:02 October 2005Publication History

K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture

Pages 137–144

ABSTRACT

Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system. We also provide empirical evidence that the superiority of our approach, beyond a 10-million-word threshold, is essentially domain- and corpus-size-independent.

References

D. A. Benson, M. S. Boguski, D. J. Lipman, J. Ostell, F. B. Ouellette, B. A. Rapp, and D. L. Wheeler. Genbank. Nucleic Acids Research, 27(1):12--17, 1999.]]Google ScholarCross Ref
O. Bodenreider, T. C. Rindflesch, and A. Burgun. Unsupervised, corpus-based method for extending a biomedical terminology. In Proceedings of the ACL Workshop on Natural Language Processing in the Biomedical Domain., pages 53--60. Pittsburgh, PA, USA. Association for Computational Linguistics, 2002.]] Google ScholarDigital Library
A. C. Browne, G. Divita, V. Nguyen, and V. C. Cheng. Modular text processing system based on the Specialist lexicon and lexical tools. In C. G. Chute, editor, AMIA'98 -- Proceedings of the 1998 AMIA Annual Fall Symposium. A Paradigm Shift in Health Care Information Systems: Clinical Infrastructures for the 21st Century, page 982. Orlando, FL, November 7-11, 1998. Philadelphia, PA: Hanley & Belfus, 1998.]]Google Scholar
N. Collier, C. Nobata, and J. Tsujii. Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. Terminology, 7(2):239--257, 2002.]]Google ScholarCross Ref
F. J. Damerau. Generating and evaluating domain-oriented multi-word terms from text. Information Processing & Management, 29(4):433--447, 1993.]] Google ScholarDigital Library
S. Evert and B. Krenn. Methods for the qualitative evaluation of lexical association measures. In ACL'01 -- Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 188--195. Toulouse, France, 2001.]] Google ScholarDigital Library
K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word-terms: the C/NC value method. International Journal of Digital Libraries, 3(2):115--130, 2000.]]Google ScholarCross Ref
Gene Ontology Consortium. Creating the Gene Ontology resource: Design and implementation. Genome Research, 11(8):1425--1433, 2001.]]Google ScholarCross Ref
W. R. Hersh, E. Campbell, D. Evans, and N. Brownlow. Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. In J. J. Cimino, editor, AMIA'96 -- Proceedings of the 1996 AMIA Annual Fall Symposium (formerly SCAMC). Beyond the Superhighway: Exploiting the Internet with Medical Informatics, pages 159--163. Washington, D.C., October 26-30, 1996. Philadelphia, PA: Hanley & Belfus, 1996.]]Google Scholar
J. S. Justeson and S. M. Katz. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1):9--27, 1995.]]Google ScholarCross Ref
M. Krauthammer and G. Nenadić. Term idenfification in the biomedical literature. Journal of Biomedical Informatics, 37(6):512--526, 2004.]] Google ScholarDigital Library
T. Kudo and Y. Matsumoto. Chunking with support vector machines. In NAACL'01, Language Technologies 2001 -- Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pages 192--199. Pittsburgh, PA, USA, June 2-7, 2001, 2001.]] Google ScholarDigital Library
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. Cambridge, MA; London, U.K.: Bradford Book & MIT Press, 1999.]] Google ScholarDigital Library
H. Mima, S. Ananiadou, and G. Nenadić. The Atract workbench: Automatic term recognition and clustering of terms. In V. Matusek, editor, Text, Speech and Dialog (TSD 2001), volume 2166 of Lecture Notes in Artificial Intelligence, pages 126--133. Berlin: Springer, 2001.]] Google ScholarDigital Library
H. Nakagawa and T. Mori. Nested collocation and compound noun for term recognition. In COMPUTERM '98 -- Proceedings of the First Workshop on Comutational Terminology, pages 64--70, 1998.]]Google Scholar
G. Nenadić, S. Ananiadou, and J. McNaught. Enhancing automatic term recognition through recognition of variation. In COLING 2004 -- Proceedings of the 20th International Conference on Computational Linguistics, pages 604--610. Association for Computational Linguistics, 2004.]] Google ScholarDigital Library
G. Nenadić, I. Spasic, and S. Ananiadou. Terminology-driven mining of biomedical literature. Journal of Biomedical Informatics, 33:1--6, 2003.]]Google Scholar
T. C. Rindflesch, L. Hunter, and A. R. Aronson. Mining molecular binding terminology from biomedical text. In N. M. Lorenzi, editor, AMIA'99 -- Proceedings of the 1999 Annual Symposium of the American Medical Informatics Association. Transforming Health Care through Informatics: Cornerstones for a New Information Management Paradigm, pages 127--131. Washington, D.C., November 6-10, 1999. Philadelphia, PA: Hanley & Belfus, 1999.]]Google Scholar
L. Sachs. Applied Statistics: A Handbook of Techniques. New York: Springer, 2nd edition, 1984.]]Google Scholar
MeSH. Medical Subject Headings. Bethesda, MD: National Library of Medicine, 2004.]]Google Scholar
Umls. Unified Medical Language System. Bethesda, MD: National Library of Medicine, 2004.]]Google Scholar
J. Wermter and U. Hahn. Collocation extraction based on modifiability statistics. In COLING Geneva 2004 -- Proceedings of the 20th International Conference on Computational Linguistics, volume 2, pages 980--986. Geneva, Switzerland, August 23-27, 2004. Association for Computational Linguistics, 2004.]] Google ScholarDigital Library

Index Terms

Finding new terminology in very large corpora

Recommendations

An electronic dictionary of computer science terminology
ICCOMP'06: Proceedings of the 10th WSEAS international conference on Computers

Automatic text analysis systems can lexically recognize a word only if it already exists in the electronic dictionary. The same thing is true for the NOOJ system analysis programs. One understands here by electronic dictionaries the lexical databases ...
Read More
Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

Chinese word segmentation is an essential step in a processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is widely adopted, ...
Read More
Morphosyntactic Parser and Textual Corpora: Processing Uncommon Phenomena of Tibetan Language
IMS2017: Proceedings of the International Conference IMS-2017

This article analyzes the problems of parsing texts with linguistic phenomena of controversial nature which may rarely be encountered in NLP projects focusing on Indo-European languages, but are quite frequent in other languages, e.g. in the corpus of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture
October 2005
234 pages
ISBN:1595931635
DOI:10.1145/1088622
General Chairs:
Peter Clark
The Boeing Company, USA
,
Guus Schreiber
Free University of Amsterdam, The Netherlands
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 October 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
knowledge extraction
natural language processing
terminology
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate55of198submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 650
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Finding new terminology in very large corpora

K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture

ABSTRACT

References

Cited By

Index Terms

Recommendations

An electronic dictionary of computer science terminology

Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

Morphosyntactic Parser and Textual Corpora: Processing Uncommon Phenomena of Tibetan Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Finding new terminology in very large corpora

K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture

ABSTRACT

References

Cited By

Index Terms

Recommendations

An electronic dictionary of computer science terminology

Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

Morphosyntactic Parser and Textual Corpora: Processing Uncommon Phenomena of Tibetan Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media