Corpora and Corpus-Based Morpho-Lexical Processing
1. Corpora and types of corpora
The term corpus as used
here refers to a collection of spoken or written texts encoded
into a specific machine readable format. Corpora are used in language
engineering to gather both qualitative and quantitative real language
evidence. Qualitative evidence consists of examples which can
be used for the construction of computational lexicons, grammars,
and multi-lingual lexicons and term banks, for lexicography, etc.
Quantitative information consists of statistics indicating frequent
or characteristic uses of language. These statistics can also
be used to guide preference-based parsers, assist in lexicography,
determine translation equivalents, etc. In addition, statistics
can be used to drive morphological taggers, POS taggers, alignment
programs, sense taggers, etc. Common operations on corpora for
the purposes of language engineering include extraction of sub-corpora;
sophisticated search and retrieval, including collocation extraction,
concordance generation, generation of lists of linguistic elements,
etc.; and the determination of statistics such as frequency information,
averages, mutual information scores, etc. We do not address corpora
intended for other applications, such as stylistic studies, socio-linguistics,
historical studies, information retrieval, etc., although these
uses are not excluded a priori (in fact, many of the features
required for these applications may be the same as those needed
for language engineering). The encoding format should be standardised
and homogeneous both for reasons concerning interchange and open-ended
retrieval tasks [1].
Treating a restricted domain enables development
of a tighter standard than that of the TEI, by providing specific
encoding solutions rather than general or multiple ones, and,
most importantly, by providing standards for elements particularly
important in that domain (e.g., linguistic annotation). The texts
are selected according to explicit criteria, according to the
main purpose the corpus is supposed to serve (lexicographic tasks;
terminological data extraction or acquisition; contrastive or
comparative studies on parallel texts; grammar induction; machine
translation; etc.). Based on several linguistic classification
criteria, Sinclair defines in [1]
a corpus typology (see also [2]
in this volume). From that typology, we are here concerned
with the special and parallel type corpora.
A special corpus is composed
of homogeneous texts, which are not necessarily representative
for a language, that is the corpus is not supposed to provide
covering evidence for all the registers of the language in case.
Of course, since these texts belong to a given language, they
will display a set of grammatical and lexical features proper
to the language in question. But, on the other hand, the texts
contained in special corpora as opposed to the reference corpora
[1]
also exhibit features (mainly at the level of the lexicon)
which may be idiosyncratic with respect to the language (rare
words, specific collocation patterns, etc.).
Whereas a reference corpus is
required to be as large as possible and based on complete and
homogeneous texts, covering all the linguistic registers and dialectal
variations, a special one is allowed to have smaller dimensions.
It may contain more specialized (and even fragmentary) texts -
for instance, texts belonging to an author, or a period - as well
as texts obtained under artificial and experimental conditions
(e.g., the ones resulted from investigations).
The second corpus type relevant
to the work reported here is the parallel corpus. It is
a collection of original texts, each of which is translated into
at least one foreign language. The most simple case is the one
where only two languages are involved. In this case, one language
translates the other. However, parallel corpora may usually contain
many languages. A parallel corpus is inherently a special one,
since even if the source texts, taken in isolation, made a reference
corpus, by translation into a foreign language, the representativeness
with respect to the new language would not be generally preserved.
Thanks to the alignment between the original and its translation(s),
parallel corpora open insightful perspectives into the nature
of the translation. Also, they may help to devise more and more
appropriate tools of translation. For instance, translation systems
that employ probabilistic machines might be increasingly used
in working with such corpora. There is a large number of applications
a corpus may serve. In [3]
we made a plea for computational corpora
and discussed some of the methodological benefits (and direct
applications) of corpus linguistics as contrasted to the classical
approach of introspective theoretical linguistics.
29