In biochemistry and related fields there is much more effort spent on data generation than on data analysis. This is reflected in the huge amount of publications that are annually produced, e.g around 450000 Medline (a database of publications in the field of medicine, biology, chemistry, biochemistry,etc.) publications per year. This massive amount of information remains "hidden" from the field experts who do not have the capacity to keep up with the number of publications. Natural language processing (especially information extraction or retrieval) can be used to help the scientist to extract, and thus analyse, relevant information for their research projects.
The GenIE project is comprised of the following sub-projects:
- lexicon development (GenIELex)
- syntatic processing
- shallow semantics
- template filling
GenIELex
The main task of the GenIELex project is the development of a biochemistry specific lexicon as well as of an annotated corpus for the evaluation of the system. The need for the construction of such a lexicon is illustrated by the following figures, based on a corpus of full papers extracted from the Journal of Biological Chemistry.
| Token |
30*106 |
| Unknown Token |
3.5*106 (~ 12%) |
| Types |
405.000 |
| Hapax Legomena (1) |
205.000 |
| Unknown Types (2) |
322.000 |
| (1) Ç (2) |
172.500 |
Of the types that occur just once (Hapax Legomena) 172,500 (i.e., 85%) are unkown terms, which very likely belong to the biochemical terminology. Taking into account that the average length of a sentence in this corpus is of 27 words and that on average every 9th token is unkown it is very probable that we will have at least one unkown token per sentence, further more the average number of unknown tokens per sentence is 3. This figures show the basic need for having at least semi-automatic methods for the lexicon generation.
Syntatic Processing (chunking)
Text chunking consists of dividing a text in syntactically correlated non-overlapping phrases (sentence parts) based on finite-state techniques. The considerable advantages a chunker has in contrast to a context-free parser are that robustness and speed are primary design considerations. In addition precision is at a high level, too.
Shallow Semantics (word sense tagging)
Within the step of semantic tagging, we annotate the chunks or partial parse trees with semantic tags. We basically distinguish between three types of semantic annotation. The first type annotates concepts of the underlying ontology. The second type associates semantic tags to closed class words of English (such as of, not, and, etc.). And a third type of semantic annotations is context dependent. The context may be given by a particular syntactic environment.
Template Filling
Template filling describes the step of transferring the relevant information from the text into the templates, where a template can be regarded as a restricted semantic representation of the desired information, giving answers to question like who, what, whom, where and sometimes even why. Within this step the results of the linguistic processing will be used. These are mainly the subcategorization frames as well as the corresponding selectional restrictions . This information will be used for mapping from chunk level into the open slots within the templates. A crucial point about information extraction is that the linear order within a sentence has to be broken up and rearranged.