Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical
Processing
In the context of MULTEXT-EAST Copernicus project
(http://www.lpl.univ-aix.fr/MTE
and http://nl.ijs.si/ME)
it was developed a parallel corpus consisting of 6 integral translations
of George Orwell's "1984": besides the original English
version, the corpus contains translations in Bulgarian, Czech,
Estonian, Hungarian, Romanian and Slovene. This corpus is consistently
SGML encoded over the seven languages (CES conformant
[4] up to
level 3), sentence aligned and morpho-syntactically annotated.
MULTEXT-EAST also produced a second multilingual and balanced
corpus (but not parallel) consisting of original novels and newspaper
articles for each language.
A second parallel corpus was developed
within the TELRI Copernicus Concerted Action (Trans European Language
Resource Infrastructure -
http://www.ids-mannheim.de/TELRI)
and consists of integral translations of Plato's "Republic"
(besides the original Greek version, the sub-corpus contains translations
in Bulgarian, Czech, English, German, Lithuanian, Polish, Romanian,
Slovak and Slovene).
It is worth mentioning that both
"1984" and "Republic" are available in many
other languages but at the time of this writing they had not been
aligned (in terms of text encoding) to the conventions used for
the languages mentioned above. The encoding of the Romanian version
of "Republic" is described in
[5].
There were several reasons for the choice of the
source texts ("1984" and Republic") based on which
the parallel corpora were built, but the most important were the
availability of the translations in all the languages of the partners
involved and copyright permissions. The two corpora are complementary
as far as the linguistic registers and (human) translation techniques
are concerned (whereas "1984" was translated more literary
- in the spirit of the book, "Republic" was translated
in a rather technical way, providing support for Sinclair's hypothesis
on "translation protected elements").
In order to develop reusable resources,
it has been essential to establish standardized methods and specifications
for these resources. For corpus encoding, MULTEXT-EAST and to
some extent TELRI, adopted the SGML-based CES (Corpus Encoding
Specifications) schema [4],
which in its turn is based on the
TEI recommendations [6, 7].
While corpus-encoding schema posed
no significant problems in accommodating the several language
specific renderings of the paralleled texts, harmonising the lexicon
encoding was a serious task. Based on the specifications developed
in the EAGLES [8] and their extension
by the MULTEXT Project [9]
to six western European languages (English, French, Dutch, Italian,
German, Spanish), MULTEXT-EAST Project further extended these
specifications to cover new languages (Bulgarian, Czech, Estonian,
Hungarian, Romanian and Slovene) with different characteristics.
Because these language families include many features and properties
not found in western European languages, such as heavy inflection
and agglutination, adapting the specifications initially developed
for western European languages to these languages posed many interesting
and difficult problems and demanded substantial assessment and
modification of the pre-existing specifications. The work carried
out in MULTEXT-EAST has thus broadened the base and contributed
significantly to defining a general mechanism for lexical specification
(<URL:
http://www.lpl.univ-aix.fr/pub/multext/docs/ME1.1.tex> and <URL:
http://www.ijs.si/ME/docs/>).
A significant benefit
from conforming to corpus and lexicon specifications originating
in EAGLES and MULTEXT, was an easy adaptation or extension of
corpus-annotation tools developed initially within the MULTEXT
project.
Since presentations from a multilingual
point of view of MULTEXT-EAST and TELRI corpora are given elsewhere
[10,11,12], in the following, we will
dwell on the Romanian component.
2. Word-form Dictionary and morpho-syntactic descriptions
For corpus morpho-lexical processing purposes (fast
text lemmatisation, language model constructing, automatic tagging,
etc.) the MULTEXT-EAST consortium developed several language specific
word-form dictionaries covering at least the words appearing in
the corpus. A dictionary entry has the following structure:
word-form <TAB> lemma <TAB> MSD <TAB>
comments
where word-form represents an
inflected form of the lemma, characterised by a
combination of feature values encoded by MSD code
(Morpho Syntactic Description); the forth column,
comments, which is optional, is currently ignored
and may contain either comments or information processable by
other tools. The morpho-syntactic descriptions are provided as
strings, using a linear encoding. In this notation, the position
in a string of characters corresponds to an attribute, and specific
characters in each position indicate the value for the corresponding
attribute.
30