Breadcrumb
- Home
- Publications
- Proceedings
- 2009 Annual Meeting
- Computing and Systems Technology Division
- Knowledge Management and Organizational Learning
- (367d) Entity Extraction for Ontology Based Intelligent Querying in the Pharmaceutical Domain
In this work, we address the problem of extracting entities (or concepts) and relations between entities to automatically build an ontology over a corpus of pharmaceutical documents. We use a classification model based on conditional random fields [2] to tag document text using predefined entity types such as TABLET, API, MANUFACTURING_PROCESS and OPERATING_CONDITION. We build an interface to the Purdue Ontology for Pharmaceutical Engineering (POPE) [3] such that the ontology engine is populated with entities and relations automatically. The ontology is then used to search for associations between entities and answer questions that help in making design decisions.
Fluck et al. [4] provide a general overview of information extraction in the life sciences industries with a special emphasis on biomedical entity extraction (for example, protein and gene names). They also describe the specific challenges in chemical entity recognition and highlight some of the recent work in that direction. Banville [5] reports the problems in mining chemical structural information from pharmaceutical literature, mainly due to the non-standard representation of chemical structures. While there is substantial effort in the biomedical and clinical domains in entity extraction and question answering [6-10], there is not much focused research in addressing this problem as applied to pharmaceutical drug design and discovery. Our work is an effort in this direction
References:
1. P. Beringer, A. DerMarderosian and L. Felton. Remington: The science and practice of pharmacy, 21st Edition, Lippincott, Williams and Wilkins, University of the sciences, Philadelphia 2006.
2. J. Lafferty, A. McCallum, F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA, 2001, pp. 282?289.
3. L. M. Hailemariam, A. Jain, P. Suresh, V. P. K. Akkisetty, G. Joglekar, S-H. Hsu, K. R. Morris, G. V. Reklaitis, P. K. Basu and V. Venkatasubramanian. The Pope Ontology for Pharmaceutical Product Development. AICHE Annual Meeting, Salt Lake City, 2007.
4. J. Fluck, M. Zimmermann, G. Kurapkat and M. Hofmann. Information extraction technologies for the life science industry. Drug Discovery Today, Vol. 2, No.3, 2005, Elsevier, DOI: 10.1016/j.ddtec.2005.08.013.
5. D. L.Banville. Mining chemical structural information from the drug literature. Drug Discovery Today, Vol. 11, No. 1/2, January 2006, Elsevier.
6. R. McDonald and F. Pereira. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics, 6(Suppl 1):S6, 2005, doi:10.1186/1471-2105-6-S1-S6.
7. L. Tanabe, N. Xie, L. H. Thom, W. Matten and W. j. Wilbur. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics, 6(Suppl 1):S3, 2005, doi:10.1186/1471-2105-6-S1-S3
8. B. Settles. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics, 21(14):3191-3192, 2005, doi:10.1093/bioinformatics/bti475
9. D. Demner-Fushman and J. Lin. Knowledge Extraction for Clinical Question Answering: Preliminary Results. In proceedings of the AAAI-05 Workshop on Question Answering in Restricted Domains, 2005
10. P. Zweigenbaum. Question Answering in Biomedicine. In proceedings of the Workshop on Natural Language Processing for Question Answering. EACL 2003.