MULTEXT-East

MULTEXT-East Language ResourcesVersion 3



MULTEXT-East Language Resources Version 3
Tomaž Erjavec,
2004-05-13

Contents

1. Introduction

This document describes the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes annotated parallel, comparable and speech corpora with morphosyntactic lexica and specifications. The most important component is the linguistically annotated corpus consisting of Orwell's novel ‘1984’ in the English original and translations.

These release builds on the results of several EU projects: MULTEXT-East (produced linked resources for Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian and English), TELRI (added resources for Lithuanian, Croatian, Serbian, and Russian; first release), and CONCEDE (validation, re-encoding; partial re-release); and the TEI Task Force on SGML to XML migration (conversion to XML).

Version 3 of MULTEXT-East resources brings together the first two releases (TELRI and CONCEDE), makes them available in TEI P4 XML, and introduces further extensions, e.g. the annotated ‘1984’ in Serbian and the specification for Resian, a dialect of Slovene. This dataset, unique in terms of languages and the wealth of encoding, is extensively documented, and freely available for research purposes, according to the MULTEXT-East research licence; a local text copy is provided for reference.

The published paper describing this version of the resources is:
Tomaž Erjavec: MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In the Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC'04 (ELRA). Paris 2004. [PDF]

Please acknowledge the use of resources in publications by citing the above paper, or, if more relevant, others from the MULTEXT-East annotated bibliography.

The rest of this document is structured as follows: In the next subsection, we give a synopsis of languages that this distribution provides resources for. Section 2 ‘MULTEXT-East resources’ describes the resouces offered in detail, containing sections on the East morphosyntactic resources, MULTEXT-East cesDoc corpus, and the MULTEXT-East ‘1984’ cesAna corpus. Section 3 ‘Bibliography and related research’ essentially points to the annotateed bibliography, Section 4 ‘Distribution’ discusses the availability of the corpus, and Section 5 ‘Contributors’ gives lists the contact points for particular languages, and acknowledgment to those that were involved in producing the resources.

1.1. The languages of MULTEXT-East

Below we list the languages represented in the resources, with the link to their respective Ethnologue ISO 639 entries.

2. MULTEXT-East resources

This section gives the details of the MULTEXT-East resources Version 3: Section ‘MULTEXT-East morphosyntactic resources’ details the three linked word-level syntactic resources: the specification for morphosyntactic descriptions, the morphosyntactic lexica, and word-level annotated corpus (‘1984’); Section ‘MULTEXT-East cesDoc corpus’ introduces the structurally marked up corpus, consisting of a parallel part (again ‘1984’), two comparable parts (fiction, newspapers), and a small speech corpus; Section ‘MULTEXT-East ‘1984’ corpus’ revisits this corpus and recaps the relevant data in somewhat more detail.

2.1. MULTEXT-East morphosyntactic resources

By far the most useful part of the MULTEXT-East project deliverables proved to be the morphosyntactic resources, which were later re-released in the CONCEDE edition and consist of three layers:

  1. The morphosyntactic specifications, which set out the grammar and vocabulary of valid morphosyntactic descriptions, MSDs. The specifications determine what, for each language, is a valid MSD and what it means, e.g.,
    Ncnp = PoS:Noun, Type:common, Gender:neuter, Number:plural
  2. The morphosyntactic lexicons, which contain the full inflectional paradigms of at least 15,000 lemmas and cover the ‘1984’ corpus. Each entry gives the word-form, its lemma and MSD, e.g.,
    clocks clock Ncnp
  3. The morphosyntactically annotated ‘1984’ corpus, where each word is assigned its context disambiguated MSD and lemma, e.g.,
    <w lemma="clock" ana="Ncnp">clocks</w>

In the next sections we detail each of these layers in turn.

2.1.1. Morphosyntactic specifications

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian, Croatian, Resian.

The syntax and semantics of the morphosyntactic descriptions (MSDs) are given in the MULTEXT-East morphosyntactic specifications, which have been developed in the formalism and on the basis of specifications for six Western European languages of the MULTEXT project and in cooperation with EAGLES, the Expert Advisory Group on Language Engineering Standards.

Originally, these specifications were released as a report of the MULTEXT-East project but have been revised for both subsequent releases. The complete specifications are structured as a report, and contain introductory chapters, followed by the list of defined categories (parts-of-speech), and then, for each category, a table of attribute-values, and the languages the features are appropriate for. These so called common tables are followed by language particular sections. Each language section is further subdivided, and can contain feature co-occurrence restrictions, examples, notes, and full lists of valid MSDs, as well as localisation information. The formal core of the specifications resides in the common tables, as they define the features, their codes for MSD representation, and their appropriateness for each language.

Technically, the complete specifications are a LaTeX document (with derived PDF and HTML renderings), where the common tables are plain ASCII in a strictly defined format. This format is suitable for a viewing, and reasonably manageable for modification and addition of new languages. However, it is not appropriated for processing, in particular for enabling smooth manipulation and linking to an XML encoded corpus using the MSDs. We have therefore implemented a (Perl) conversion of the common tables into XML, to the TEI.fs module, a tagset devoted to encoding feature-structures. This tagset is currently being used as the basis of an evolving ISO standard (currently a Draft International Standard), as part of work of ISO/TC 37/SC4 Language Resource Management.

In the distribution, the specifications are given in the original and several derived encodings; furthermore, closely related reports of MULTEXT and EAGLES are provided:

msd/msd.pdf and msd/msd2up.pdf
The MULTEXT-East specifications in PDF, normal and tree saving version.
msd/html/
The specifications as HTML
msd/tei/
The specifications as XML / TEI P4 fs libraries
msd/tex/ and msd/bin/
The specifications in the source LaTeX and programs to make and covert the specifications
msd/related/
Copies or links to reports that served as the basis for MULTEXT-East morphosytactic specifications:
  • Nuria Bel, Nicoletta Calzolari and Monica Monachini: MULTEXT Deliverable D1.6.1B Common Specifications and Notation for Lexicon Encoding and Preliminary Proposal for the Tagsets. 1995, ILC, Pisa. [ HTML, PDF, PDF2up, LaTeX ]
  • Simone Teufel and Christine Stockert: EAGLES specification for German morphosyntax. 1996. [ PDF, PS.gz ]
  • Geoffrey Leech and Andrew Wilson: EAGLES Report EAG-TCWG-MAC/R Recommendations for the Morphosyntactic Annotation of Corpora. 1996, ILC, Pisa.
  • Nicoletta Calzolari and Monica Monachini (eds.): EAGLES Report EAG-CLWG-MORPHSYN/R Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora: A Common Proposal and Applications to European Languages. 1996, ILC. Pisa.

2.1.2. Lexicons

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian.

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each lexical entry is composed of three fields:

  • the word-form, which is the inflected form of the word, as it appears in the text, modulo sentence-initial capitalisation;
  • the lemma, which is the base-form of the word; where the entry is itself the base-form, the lemma is given as the equal sign;
  • the MSD, i.e., the morphosyntactic description.

To produce the lexica, the token lists of the MULTEXT-East corpus were first fed through morphological analysers in order to produce the lemma list; this list was further extended from the comparable corpus, to arrive at at least 15,000 lemmas - some languages have further extended this, e.g., Romanian to 41,000 lemmas. In the next step, the lemmas were fed back to morphological generators (except for the agglutinative languages) in order to produce the complete inflected lists, i.e., the full paradigms of the lemmas, which constituted the final lexica of the project.

The MULTEXT-East lexica serve as medium sized morphological lexica for the languages. In addition to explicating the inflectional behaviour of the most common (and, typically, morphologically the most complex) words of the languages, the lexica also serve to establish the definitive set of valid MSDs for the languages.

To serve as a standard registry of MSDs, we converted the lexical MSDs to TEI feature structure libraries, fsLib, one for each category. Here each MSD is expressed as a feature structure specifying its id, the language(s) it is appropriate for, and its decomposition into features.

The structure and contents of the lexica is explained in MULTEXT-East report MTE D1.2 M: Language-specific Resources (but note that the lexica have in the meantime been revised, so the details in these report are no longer correct).

The lexica themselves are available in the directory lex/, where there is also a README file and (maybe) a file giving various counts on the lexica (note that WWW access to this directory is restricted).

2.1.3. Linguistically annotated 1984

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian.

This is the centrepiece of the resources, as it contains word level markup, namely context disambiguated lemmas and MSDs, e.g., <w lemma="it" ana="Pp3ns">it</w>. This, so called cesAna corpus, is suitable for PoS tagging experiments; because it was the first such resource for many of the languages it was also the most difficult to produce as the work had to proceed mostly manually.

The corpus is available in the directory crp/, where the driver file is mte-cesana.xml (note that WWW access to this directory is restricted).

2.2. MULTEXT-East cesDoc corpus

The extended MULTEXT-East resources were first released in the scope of the TELRI concerted action, first on a CD-ROM and later through TRACTOR, the TELRI Research Archive of Computational Tools and Resources.

For Version 3 the structurally annotated texts from this edition (so called cesDoc corpus, as this was the format it was originally encoded in) were converted from SGML to XML. The corpus is stored as one TEI P4 document with the <teiCorpus.2> root element, comprising the header, and the component texts, i.e. the multilingual parallel speech, comparable fiction and news, and parallel ‘1984’ texts. The corpus is further documented in its corpus and text headers, and also in the original MULTEXT-East report D2.1 F: Corpus Collection and Preparation, although the information there is no longer current in all respects.

The corpus is available in the directory crp/, where the driver file is mte-cesdoc.xml. The complete corpus texts are also available in HTML, in the directory htm/ (driver file) produced by the Sebastian Rhatz' TEI stlysheets (note that WWW access to these directories is restricted).

2.2.1. Quick HTML view

The table below links to the HTML view of the MULTEXT-East cesDoc corpus, namely to the teiHeaders and the texts.

1, MULTEXT-East cesDoc corpus, HTML view

MULTEXT-East cesDoc corpus, HTML view
Language "1984" Speech Fiction News
English Header Text Header Text
Bulgarian Header Text Header Text Header Text Header Text
Czech Header Text Header Text Header Text Header Text
Estonian Header Text Header Text Header Text Header Text
Hungarian Header Text Header Text Header Text Header Text
Romanian Header Text Header Text Header Text Header Text
Slovene Header Text Header Text Header Text Header Text
Lithuanian Header Text
Serbian Header Text
Russian Header Text

2.2.2. Speech corpus components

Languages: Romanian, Slovene, Estonian, Hungarian, (English, Czech, Bulgarian)

MULTEXT-East produced a small corpus of spoken texts taken from the EUROM-1 speech corpus. It comprises the translations (from English) of forty short passages of five thematically connected sentences. For four languages, the texts have also been read, recorded and included in the distribution. The corpus texts contain links to the spoken passages, which have for V3 been normalised in terms of volume, and stored as .wav files. The speech files are, due to their size, stored and distributed in a separate bundle.

2.2.3. Comparable corpus components

Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.

The multilingual comparable corpus contains a fiction part and a news part, where the data is comparable across the languages in terms of the number and size of texts; each of the 12 parts has approx. 100,000 words. The corpus is structurally marked up with over 40 different elements; however. sub-paragraph markup has not been harmonised across the languages.

2.2.4. Structural 1984 and alignments

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, (Latvian), Lithuanian, Serbian, Russian.

The multilingual parallel corpus consists of the novel ‘1984’, about 100,000 words in length. The corpus contains extensive headers and markup for document structure, sentences, and various sub-sentence annotations, these similar to the comparable corpus, but better harmonised over languages.

The translations of ‘1984’ have been automatically sentence aligned with the English original, and the alignments hand-validated. The bilingual alignments are valid to xcesAlign.dtd, i.e., are stored not with the primary data but in separate documents, as references to sentence IDs, e.g., <link xtargets="Osl.1.2.6.6 ; Ocs.1.1.5.6 Ocs.1.1.5.7"/>.

The cesDoc encoded novel served as the basis for producing the linguistically annotated version. The link between the two is maintained via sentence identifiers.

2.3. MULTEXT-East 1984 corpus

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus annotated contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.

George Orwell
Nineteen Eighty-Four


Big Brother is Watching You!
It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air. From where Winston stood it was just possible to read, picked out on its white face in elegant lettering, the three slogans of the Party:
War is peace Freedom is slavery Ignorance is strength
Războiul este pace Libertatea este sclavie Ignoranţa este putere
Vojna je mir Svoboda je suženjstvo Nevednost je moč
Válka je mír Svoboda je otroctví Nevědomost je síla
Войната е мир Свободата е робство Невежеството е сила
Sõda on rahu Vabadus on orjus Teadmatus on jõud
Rat je mir Sloboda je ropstvo Neznanje je moć
A háború: béke A szabadság: szolgaság A tudatlanság: erő
Karas — tai taika Laisvė — tai vergija Nežinomas — tai jėga
Rat je mir Sloboda je ropstvo Neznanje je moć
Война — это мир Свобода — это рабство Незнание — сила

2.3.1. Quick HTML view

The table below gives, for each language, the HTML rendering of: the cesDoc corpus Text of the novel from the cesDoc corpus; the relevant section from the MULTEXT-East D2.1 Report the TEI cesDoc header of "1984", and, where available, the TEI cesAna header with the language particular section of the MULTEXT-East Morphosyntactic Specification for MSD.

4, MULTEXT-East "1984", HTML view

MULTEXT-East "1984", HTML view
All Text Report cesDoc header cesAna header MSD
English Text Report cesDoc header cesAna header MSD
Bulgarian Text Report cesDoc header cesAna header MSD
Czech Text Report cesDoc header cesAna header MSD
Estonian Text Report cesDoc header cesAna header MSD
Hungarian Text Report cesDoc header cesAna header MSD
Romanian Text Report cesDoc header cesAna header MSD
Slovene Text Report cesDoc header cesAna header MSD
Latvian Report
Lithuanian Text Report cesDoc header
Serbian Text Report cesDoc header cesAna header MSD
Russian Text cesDoc header

2.3.2. Description

The novel is encoded in TEI P4 and exists in two versions:

  • as part of the cesDoc corpus (originally from the TELRI release), where it is structurally marked up and harmonised down to paragraph level; various sub-paragraph markup is also included; this version is also available in HTML, via the TEI XSLT stylesheet.
  • as the cesAna corpus (from the CONCEDE release), where the texts have been structurally normalised and tokenised, and each word tagged with hand-validated context-disambiguated lemma and morphosyntactic description, so it can be used e.g. in PoS tagging and morphological analysis experiments.

Apart from the marked-up texts themselves the corpus has also two other components:

  • validated sentence alignments (mostly two-way with the English original), which are stored in accordance with the CES conventions for parallel text alignment, i.e. in separate documents containing links to <s>entence elements of the texts.
  • the formal core of the MULTEXT-East morphosyntactic specification, which comprises the TEI header, feature-structure and feature libraries. The former contain the full set of valid morphosyntactic descriptions, and the latter their decomposition into features; this specification also constitutes a part of the cesAna corpus.

The first edition of the "1984" corpus is detailed in the MULTEXT-East project reports MTE D2.1 F ""1984" Corpus Collection" (with Appendix) and MTE D2.3 F "Sentence Alignment" and "Morphosyntactic Tagging". However, note that the encoding has substantially changed from the first MULTEXT-East version described in the preceeding reports. The details on the encoding for Version 3 are to be found in the TEI headers in the two corpora, i.e. the cesDoc teiHeader and cesAna teiHeader. The details on the morphosyntactic descriptions used to word-tag V3 of the corpus are given in the MULTEXT-East Morphosyntactic Specifications, and also in the MSD library teiHeader, which is a part of the cesAna corpus.

2.4. Distribution

The resources are mounted on the Web, on the MULTEXT-East Version 3 Web site. The documentation and the morphosyntactic specification are freely available, while the corpora and lexica are restricted to research use only. To get access to these resources, the Web based MULTEXT-East research licence should be filled out and submitted. The password is then sent by email.

Registered users can browse the full resources on-line, or they can download them; they are distributed as gzipped tar files (.tgz). Due to the size of the resources and the fact that different users are likely to use only parts, they are available not only as a complete download, but also split by resource type:

mteV3(date).tgz (136M tgz / 295M unpacked)
The complete V3 resources: documentation, MSD specification, corpora, lexica, speech
mteV3(date)-doc.tgz (10M / 27M)
The documentation - those fetching other partial archives are strongly advised to download also the documentation.
mteV3(date)-ana.tgz (14M / 62M)
The morphosyntactic resources, namely the morphosyntactic specifications, lexica, and cesAna corpus.
mteV3(date)-crp.tgz (7M / 37M)
The cesDoc corpus.
mteV3(date)-spch.tgz (100M / 121M)
The speech corpus.

After download unpack the directories. So, for examples, say the distribution was made on 2004-05-05 and you want everything except speech. Then you download -doc, -crp, and -ana, and, on a Unix machine, run:

$ tar xfz mteV3-2004-05-05-doc.tgz
$ tar xfz mteV3-2004-05-05-crp.tgz
$ tar xfz mteV3-2004-05-05-ana.tgz
All three archives would unpack into the directory mteV3-2004-05-05/

At some point we plan to distribute the MULTEXT-East V3 resources also on CD-ROM. If this would be of interest, please get in touch.

3. Bibliography and related research

An annotated bibliography is available in a separate folder bib/, where it is available in HTML and PDF. Some papers are also mirrored there, in particular the one describing Version 3 of the MULTEXT-East resources:

Tomaž Erjavec: MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In the Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC'04 (ELRA). Paris 2004

4. Contributors

This section lists the contributors of the MULTEXT-East resources (Version 3), starting with the partners of the MULTEXT-East project, the partners of the TELRI concerted action that contributed resources to MULTEXT-East, additional contributors, and, finally, the acknowledgements to all the other people that made the production of these resources possible. Further information on who did what can be found in the corpus headers and the title page of the morphosyntactic specifications. Next to each partner is listed also their responsibility; for resources of the particular languages, this contact point should also be used for further inquiries.

4.1. MULTEXT-East Partners

Listed below are the partners of the original Copernicus MULTEXT-East project [Picture].

4.1.1. Aix-en-Provence

Responsibility:
MULTEXT-East coordinator
People:
Jean Véronis (multext at univ-aix.fr)
Organisation:
Laboratoire Parole et Langage
Centre National de la Recherche Scientifique
Aix-en-Provence, France
Address:
29, Av. Robert Schuman
13621 Aix-en-Provence Cedex 1

4.1.2. Vassar

Responsibility:
Corpus encoding, English data
People:
Nancy Ide (ide at cs.vassar.edu)
Greg Priest-Dorman (priestdo at cs.vassar.edu) [Picture]
Organisation:
Dept. of Computer Science
Vassar College
Address:
124 Raymond Avenue
Poughkeepsie, NY 12604-0520

4.1.3. Pisa

Responsibility:
Morphosyntactic specification (associate partner)
People:
Nicoletta Calzolari (glottolo at vm.cnuce.cnr.it)
Monica Monachini (corpmon2 at vm.cnuce.cnr.it)
Organisation:
Istituto di Linguistica Computazionale
Consiglio Nazionale delle Ricerche
Pisa, Italy

4.1.4. Sofia

Responsibility:
Bulgarian data
People:
Radoslav Pavlov,
Ludmila Dimitrova (ludmila at bgearn.acad.bg)
Lydia Sinapova (lydia at iinf.iinf.bg)
Kiril Simov (kivs at bgcict.acad.bg)
Organisation:
Department of Mathematical Linguistics
Institute of Mathematics and Informatics
Bulgarian Academy of Sciences
Address:
8 Acad. G. Bonchev Street
BG-1113 Sofia
Bulgaria
Telephone:
+359 2 713 3831
Fax:
+359 2 971 36 49

4.1.5. Prague

Responsibility:
Czech data
People:
Vladimír Petkevič (vladimir.petkevic at ff.cuni.cz) [Picture]
Organisation:
Institute of Theoretical and Computational Linguistics
Faculty of Philosophy
Charles University
Address:
Celetna 13
11000 Prague 1
Czech Republic
Telephone:
+42 2 2481-1870 ext. 252
Fax:
+42 2 2481-2166
Subcontractor:

People:
Zdenek Laciga
Organisation:
BYLL Software, Ltd.
Address:
Anenske nam. 2
110 00 Prague 1
Czech Republic

4.1.6. Tartu

Responsibility:
Estonian data
People:
Heiki-Jaan Kaalep (hkaalep at psych.ut.ee) [Picture]
Organisation:
Department of General Linguistics
Tartu University
Address:
Ülikooli 18
EE-2400 Tartu
Estonia
Telephone:
+372 74 30 803
Fax:
+372 74 35 440

4.1.7. Budapest

Responsibility:
Hungarian data
People:
Laszlo Tihanyi (Tihanyi at nytud.hu)
Csaba Oravecz (oravecz at nytud.hu) [Picture]
Organisation:
Research Institute for Linguistics
Hungarian Academy of Sciences
Address:
Szinhaz u. 5-9 - P.O. Box 19
H-1250 Budapest
Hungary
Telephone:
+36 1 1758 011 ext. 155
+36 1 1758 285
Fax:
+36 1 21 22 050
Subcontractor:

People:
Gabor Proszeky
Organisation:
MorphoLogic
Address:
Fo u. 56-58 I/3
H-1011 Budapest
Hungary
Telephone:
+36 1 2018355
Fax:
+36 1 2018355

4.1.8. Bucharest

Responsibility:
Romanian data
People:
Dan Tufiş (tufis at valhalla.racai.ro) [Picture]
Ştefan Bruda (bruda at magus.racai.ro)
Mihai Ciocoiu (mihaic at magus.racai.ro)
Organisation:
Center for Research in Machine Learning, Natural Language Processing and Conceptual Modelling
Romanian Academy of Sciences
Address:
Casa Academiei "13 Septembrie" 13
Bucharest 71102
Romania
Telephone:
+40 1 410-4113
Fax:
+40 1 411-3916

4.1.9. Ljubljana

Responsibility:
Slovene data, corpus workpackage leader
People:
Tomaž Erjavec (tomaz.erjavec at ijs.si) [Picture]
Organisation:
Dept. of Knowledge Technologies
Jožef Stefan Institute
Address:
Jamova 39
SI-1000 Ljubljana
Slovenia
Telephone:
+386 61 1773-507
Fax:
+386 61 1258-058
Subcontractor:

People:
Miro Romih (miro.romih at amebis.si)
Peter Holozan (peter.holozan at amebis.si)
Organisation:
AMEBIS d.o.o.
Address:
Jakopičeva 6
61240 Kamnik
Slovenia
Telephone:
+386 61 612-829
Fax/Telephone:
+386 61 811-035

4.2. TELRI Partners

4.2.1. Riga

Responsibility:
Latvian data
People:
Andrejs Spektors (aspekt at mii.lu.lv)
Organisation:
Artificial Intelligence Laboratory
Institute of Mathematics and Computer Science
University of Latvia
Address:
29, Raina bulv.
LV-1459 Riga
Latvia

4.2.2. Kaunas

Responsibility:
Lithuanian data
People:
Andrius Utka (andrius at donelaitis.vdu.lt)
Ruta Marcinkeviciene (ruta.marcinkeviciene at vdu.lt)
Organisation:
Center of Computational Linguistics
Faculty of Humanities
Vytautus Magnus University
Address:
S. Daukanto 28
3000 Kaunas
Lithuania

4.2.3. Belgrade

Responsibility:
Serbian data
People:
Cvetana Krstev (cvetana at matf.bg.ac.yu)
Duško Vitas (vitas at matf.bg.ac.yu)
Organisation:
Faculty of Mathematics
Belgrade University
Address:
Studentski trg 16
11000 Belgrade
Serbia and Montenegro

4.2.4. Zagreb

Responsibility:
Croatian data
People:
Marko Tadić (marko.tadic at ffzg.hr)
Organisation:
Department of linguistics
Faculty of Philosophy,
University of Zagreb
Address:
Ivana Lučića 3
HR-10000 Zagreb
Croatia

4.3. Additional Contributors

4.3.1. Severodonetsk

Responsibility:
Russian data
People:
Paul Sokolovsky (Paul.Sokolovsky at technologist.com)
Sergey Sryvkin
Organisation:
Severodonetsk Institute of Technology
East-Ukraine State University
Address:
Sovetsky st., bl. 3a
Severodonetsk, Lugansk reg.
Ukraine

4.3.2. Padova

Responsibility:
Resian data
People:
Han Steenwijk (han.steenwijk at unipd.it)
Organisation:
Department of Anglo-Germanic and Slavic Linguistics and Literature,
Padova University

4.4. Acknowledgments

Apart from the institutions and people listed above, the following people have greatly contributed to the production of the MULTEXT-East language resources: Renata Anžič, Liviu Anca, Ana-Maria Barbu, Aleksandra Bizjak, Damjan Bojadžiev, Lydia Bozhilova, Aleš Dobinikar, Külli Habicht, Daniel Hirst, Milena Hnátková, Primož Jakopin, Riina Mosna, Kadri Muischnek, Mircea Nicolescu, Heili Orav, Vasile Pătraşcu, Leho Paldre, Tsvetan Petrov, Helen Potter, Andriela Rääbis, Georgiana Rotariu, Matjaž Sešek, Tanja Semen, Urve Talvik, Bojana Todorovič, Elias Treeman, Viire Villandi, and Olga Vuković.

The work on MULTEXT-East resources was supported by the European Union project MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages, Copernicus 106, and in part by the US National Science Foundation grant IRI-9413451. The MULTEXT-East project results have been greatly enhanced due to the EU Concerted Action TELRI: Trans-European Language Resources Infrastructure. In particular, additional language resources have been produced, and the project's material organised for CD distribution. Work on the second release of the MULTEXT-East resources was supported by EU Copernicus Project PL96-1142 CONCEDE:Consortium for Central European Dictionary Encoding, while the work on the third release was partially funded by the grants from the National Endowment for the Humanities, in the scope of the TEI Task Force on SGML to XML migration. The work on the resources has been additionally supported by national funding bodies and individual partners' grants and contracts.



Date: (revised 2004-05-13) Author: Tomaž Erjavec, (revised Tomaž Erjavec).