Corpora and Corpus-Based Morpho-Lexical Processing

Dan Tufis, Ana-Maria Barbu, Vasile Pãtrascu,
Georgiana Rotariu, Camelia Popescu


1. Corpora and types of corpora

The term corpus as used here refers to a collection of spoken or written texts encoded into a specific machine readable format. Corpora are used in language engineering to gather both qualitative and quantitative real language evidence. Qualitative evidence consists of examples which can be used for the construction of computational lexicons, grammars, and multi-lingual lexicons and term banks, for lexicography, etc. Quantitative information consists of statistics indicating frequent or characteristic uses of language. These statistics can also be used to guide preference-based parsers, assist in lexicography, determine translation equivalents, etc. In addition, statistics can be used to drive morphological taggers, POS taggers, alignment programs, sense taggers, etc. Common operations on corpora for the purposes of language engineering include extraction of sub-corpora; sophisticated search and retrieval, including collocation extraction, concordance generation, generation of lists of linguistic elements, etc.; and the determination of statistics such as frequency information, averages, mutual information scores, etc. We do not address corpora intended for other applications, such as stylistic studies, socio-linguistics, historical studies, information retrieval, etc., although these uses are not excluded a priori (in fact, many of the features required for these applications may be the same as those needed for language engineering). The encoding format should be standardised and homogeneous both for reasons concerning interchange and open-ended retrieval tasks [1]. Treating a restricted domain enables development of a tighter standard than that of the TEI, by providing specific encoding solutions rather than general or multiple ones, and, most importantly, by providing standards for elements particularly important in that domain (e.g., linguistic annotation). The texts are selected according to explicit criteria, according to the main purpose the corpus is supposed to serve (lexicographic tasks; terminological data extraction or acquisition; contrastive or comparative studies on parallel texts; grammar induction; machine translation; etc.). Based on several linguistic classification criteria, Sinclair defines in [1] a corpus typology (see also [2] in this volume). From that typology, we are here concerned with the special and parallel type corpora.

A special corpus is composed of homogeneous texts, which are not necessarily representative for a language, that is the corpus is not supposed to provide covering evidence for all the registers of the language in case. Of course, since these texts belong to a given language, they will display a set of grammatical and lexical features proper to the language in question. But, on the other hand, the texts contained in special corpora as opposed to the reference corpora [1] also exhibit features (mainly at the level of the lexicon) which may be idiosyncratic with respect to the language (rare words, specific collocation patterns, etc.).

Whereas a reference corpus is required to be as large as possible and based on complete and homogeneous texts, covering all the linguistic registers and dialectal variations, a special one is allowed to have smaller dimensions. It may contain more specialized (and even fragmentary) texts - for instance, texts belonging to an author, or a period - as well as texts obtained under artificial and experimental conditions (e.g., the ones resulted from investigations).

The second corpus type relevant to the work reported here is the parallel corpus. It is a collection of original texts, each of which is translated into at least one foreign language. The most simple case is the one where only two languages are involved. In this case, one language translates the other. However, parallel corpora may usually contain many languages. A parallel corpus is inherently a special one, since even if the source texts, taken in isolation, made a reference corpus, by translation into a foreign language, the representativeness with respect to the new language would not be generally preserved. Thanks to the alignment between the original and its translation(s), parallel corpora open insightful perspectives into the nature of the translation. Also, they may help to devise more and more appropriate tools of translation. For instance, translation systems that employ probabilistic machines might be increasingly used in working with such corpora. There is a large number of applications a corpus may serve. In [3] we made a plea for computational corpora and discussed some of the methodological benefits (and direct applications) of corpus linguistics as contrasted to the classical approach of introspective theoretical linguistics.


29

Next