MULTEXT-East word-level annotated multilingual corpus: Nineteen Eighty-Four
§name Tomaž Erjavec, IJS
§responsibility TEI encoding
§name Nancy Ide, Vassar
§responsibility English data
§name Dan Tufiş, RACAI
§responsibility Romanian data
§name Heiki-Jaan Kaalep, TU
§responsibility Estonian data
§name Csaba Oravecz, HAS
§responsibility Hungarian data
§name Vladimír Petkevič, ITCL
§responsibility Czech data
§name Ludmila Dimitrova, BAS
§responsibility Bulgarian data
§name Cvetana Krstev, Duško Vitas
§responsibility Serbian data
§name Tomaž Erjavec, IJS
§responsibility Slovene data
§name Katerina Zdravkova
§responsibility Macedonian
§name Behrang QasemiZadeh
§responsibility Persian data
§name Natalia Kotsyba
§responsibility Polish data
§name Radovan Garabik
§responsibility Slovak data
EU Copernicus Project COP106 "MULTEXT-East"
EU Copernicus Concerted Action "TELRI"
EU Copernicus Project PL96-1142 "Concede"
EU Capacities Project GA 211938 "MondiLex"
§funding body Individual partners' grants and contracts
MULTEXT-East, Version 4
968,354 word tokens
§distributor MULTEXT-East Web site
Available for research purposes upon receipt of agreement. In published work based on this resource please cite the appropriate publication from the home page of the project.

§title Multext-East/Concede: Nineteen Eighty-Four, Multilingual
§funding body EU Copernicus Project PL96-1142 "Concede"
§funding body EU Copernicus Project COP106 "MULTEXT-East"
§funding body Individual partners' grants and contracts.
§edition Version 3
§distributor MULTEXT-East Web site
Available for research purposes upon receipt of signed agreement.

§project description

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. <pointer>

§correction principles

Since the CD-ROM release of the 1984s, many errors of linguistic annotations have been corrected in the individual texts.

In the process of conversion to TEI, various format errors were detected and corrected.


The novels have their markup normalised: a) structure annotation with div and p (attributes xml:id and type); b) segmentation annotation with s (attribute xml:id); c) tokenisation annotation with w, c, (attribute type) d) linguistic annotation with w attributes lemma and ana.

All the novels use UTF-8 encoding.

quote elements have been in general changed to P

q markup has been in some novels (see individual Headers) omitted, while it is in others present as quote.


Segmentation into paragraphs follows the printed sources; it therefore not 1-1 with the English original. Segmentation into sentences was performed automatically and then hand-validated.

Tokenisation into words and punctuation symbols was perfumed on the basis of MULTEXT-East lexica, mostly with the MULTEXT tools 'mtseg' and then hand-validated.


No end-of-line hyphenation present in texts.


The linguistic interpretation of the text consists of marking up the word tokens with their context disambiguated lemma and MULTEXT-East morphosyntactic description. The various texts have undergone various amounts of validation, so error-rates between them differ.

§standard values

The two-letter language codes follow ISO 639.

The MULTEXT-East morphosyntactic descriptions (MSDs) follow the revised common tables of lexical specifications MULTEXT-East/Mondilex. The lexical MSDs have been converted to a fslib, a feature-structure library, while their decomposition into features is given in a flib, a feature library. The words in the texts have theirs MSD encoded as the value of the ana (#IDREF) attribute. This attribute refers to a fs, which, in turn, refers via its #IDREFS feats to the f elemetns that define it.

ident = bg
ident = cs
ident = en
ident = et
ident = hr
ident = hu
ident = mk
ident = pl
ident = ro
ident = ru
ident = sh
ident = sk
ident = sl
ident = sl-rozaj
Resian (dialect of Slovene)
ident = sr
ident = uk
§change 2010-05-09<date>Tomaž Erjavec<name>Conversion to MULTEXT-East V4 / TEI P5.
§change 2004-05-10<date>Tomaž Erjavec<name>From BETA to FINAL V3
§change 2004-04-09<date>Tomaž Erjavec<name>Added Serbian
§change 2004-02-27<date>Tomaž Erjavec<name>Harmonised with TELRI/cesDoc corpus.
§change 2003-02-11<date>Tomaž Erjavec<name>Conversion to TEI P4 XML
§change 2001-03-19<date>Tomaž Erjavec<name>Modifications to teiHeaders; new MSD library
§change 2000-10-30<date>Tomaž Erjavec<name>Conversion to TEI, initial teiHeader