WikiBiography
WikiBiography is a corpus of about 1200 annotated biographies from the
German version of Wikipedia. Fully
automatic preprocessing includes
the following:
- sentence boundaries
- part-of-speech tags
- word lemmas
- syntactic dependencies
- anaphora resolution*
- discourse connectives
- classified named entities
- temporal expressions
* there is only one coreference
chain which links all mentions of the biographee.
The annotation is done with freely available software (see
references). To visualize the data and access and correct the
annotation you should use MMAX2. With MMAX2 API you can
access any layer of annotation from your Java programs.
Screenshots
Orange and green fonts are used for temporal expressions
(e.g. "7. Oktober 1885", "später") and locations
(e.g. "Kopenhagen", "Dänemarks") respectively. People other than
the biographee (e.g. "Chtistian Bohr", "Harald Bohr") are highlighted
with light-blue. Mentions of the biographee are highlighted with red
(e.g. "Niels Henrik David Bohr", "er", "Niels Bohr"). The annotation
of a selected word (e.g. "Professor") is displayed in a separate
window. The head of the word is highlighted with grey colour then and
an ark from the dependent word to its head is displayed.
Code Sample
Download
Click here
to download WikiBiography.
References
A CPAN Perl module is used for sentence boundaries identification.
TNT tagger
is used for PoS-tagging:
Brants, T.: 2000, 'TnT - A statistical Part-of-Speech tagger'. In:
Proceedings of the 6th Conference on Applied Natural Language
Processing, Seattle, Wash., 29 April - 4 May 2000. pp. 224-231.
TreeTagger
is used for lemmatization:
Schmid, H.: 1997, 'Probabilistic part-of-speech tagging using decision
trees'. In: D. Jones and H. Somers (eds.): New Methods in Language
Processing. London, UK: UCL Press, pp. 154-164.
WCDG
parser is used for dependency parsing:
Foth, K. and W. Menzel: 2006, 'Hybrid parsing: Using probabilistic
models as predictors for a symbolic parser'. In: Proceedings of the
44th Annual Meeting of the Association for Computational Linguistics,
Sydney, Australia, 17-21 July 2006. pp. 321-327.
A list of about 300 connectives from IDS
Mannheim is used to identify these connectives in our
corpus.
Temporal expressions are identified with a set of templates. Named
entities are classified as person, location or organization based on
the information from Wikipedia.
<< back
|