TEI Header

§file description
§title statement
id = mte-oana.title
MULTEXT-East word-level annotated multilingual corpus: Nineteen Eighty-Four
§statement of responsibility
§name Tomaž Erjavec, IJS
§responsibility TEI encoding
§statement of responsibility
§name Nancy Ide, Vassar
§responsibility English data
§statement of responsibility
§name Dan Tufiş, RACAI
§responsibility Romanian data
§statement of responsibility
§name Heiki-Jaan Kaalep, TU
§responsibility Estonian data
§statement of responsibility
§name Csaba Oravecz, HAS
§responsibility Hungarian data
§statement of responsibility
§name Vladimír Petkevič, ITCL
§responsibility Czech data
§statement of responsibility
§name Ludmila Dimitrova, BAS
§responsibility Bulgarian data
§statement of responsibility
§name Cvetana Krstev, Duško Vitas
§responsibility Serbian data
§statement of responsibility
§name Tomaž Erjavec, IJS
§responsibility Slovene data
§statement of responsibility
§name Katerina Zdravkova
§responsibility Macedonian
§statement of responsibility
§name Behrang QasemiZadeh
§responsibility Persian data
§statement of responsibility
§name Natalia Kotsyba
§responsibility Polish data
§statement of responsibility
§name Radovan Garabik
§responsibility Slovak data
§funding body EU Copernicus Project COP106 "MULTEXT-East"
§funding body EU Copernicus Concerted Action "TELRI"
§funding body EU Copernicus Project PL96-1142 "Concede"
§funding body EU Capacities Project GA 211938 "MondiLex"
§funding body Individual partners' grants and contracts
§edition statement
§edition MULTEXT-East, Version 4
type = words
968,354 word tokens
§publication statement
§distributor MULTEXT-East Web site
§address http://nl.ijs.si/ME/V4/
§distributor Individual partners, c.f. component headers

Available for research purposes upon receipt of agreement. In published work based on this resource please cite the appropriate publication from the home page of the project.

§source description
§fully-structured bibliographic citation
§title statement
§title Multext-East/Concede: Nineteen Eighty-Four, Multilingual
§funding body EU Copernicus Project PL96-1142 "Concede"
§funding body EU Copernicus Project COP106 "MULTEXT-East"
§funding body Individual partners' grants and contracts.
§edition statement
§edition Version 3
§publication statement
§distributor MULTEXT-East Web site
§address http://nl.ijs.si/ME/V3/

Available for research purposes upon receipt of signed agreement.

when = 2004-05-10
§source description
§fully-structured bibliographic citation
title statement
title Multext-East cesAna: Nineteen Eighty-Four
funding body EU Copernicus Project COP106 "MULTEXT-East"
funding body EU Copernicus Action "TELRI"
edition statement
edition MULTEXT-East Final Release
publication statement
distributor TRACTOR: TELRI Research Archive of Computational Tools and Resources
publication place "East meets West" CD-ROM, ISBN 3-922641-46-6
distributor MULTEXT-East Web site
address http://nl.ijs.si/ME/CD/
when = 1998-01-01
January 1st, 1998
source description
citation list
structured bibliographic citation
monographic level
title 1984
author George Orwell
date 1949; reprinted 1961
publisher New American Library
publication place New York
§encoding description
§project description

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. <pointer>

§editorial practice declaration
§correction principles

Since the CD-ROM release of the 1984s, many errors of linguistic annotations have been corrected in the individual texts.

In the process of conversion to TEI, various format errors were detected and corrected.


The novels have their markup normalised: a) structure annotation with div and p (attributes xml:id and type); b) segmentation annotation with s (attribute xml:id); c) tokenisation annotation with w, c, (attribute type) d) linguistic annotation with w attributes lemma and ana.

All the novels use UTF-8 encoding.

form = unknown

quote elements have been in general changed to P

q markup has been in some novels (see individual Headers) omitted, while it is in others present as quote.


Segmentation into paragraphs follows the printed sources; it therefore not 1-1 with the English original. Segmentation into sentences was performed automatically and then hand-validated.

Tokenisation into words and punctuation symbols was perfumed on the basis of MULTEXT-East lexica, mostly with the MULTEXT tools 'mtseg' and then hand-validated.


No end-of-line hyphenation present in texts.


The linguistic interpretation of the text consists of marking up the word tokens with their context disambiguated lemma and MULTEXT-East morphosyntactic description. The various texts have undergone various amounts of validation, so error-rates between them differ.

§standard values

The two-letter language codes follow ISO 639.

The MULTEXT-East morphosyntactic descriptions (MSDs) follow the revised common tables of lexical specifications MULTEXT-East/Mondilex. The lexical MSDs have been converted to a fslib, a feature-structure library, while their decomposition into features is given in a flib, a feature library. The words in the texts have theirs MSD encoded as the value of the ana (#IDREF) attribute. This attribute refers to a fs, which, in turn, refers via its #IDREFS feats to the f elemetns that define it.

§text-profile description
§language usage
ident = bg
ident = cs
ident = en
ident = et
ident = hr
ident = hu
ident = mk
ident = pl
ident = ro
ident = ru
ident = sh
ident = sk
ident = sl
ident = sl-rozaj
Resian (dialect of Slovene)
ident = sr
ident = uk
§revision description
§change 2010-05-09<date>Tomaž Erjavec<name>Conversion to MULTEXT-East V4 / TEI P5.
§change 2004-05-10<date>Tomaž Erjavec<name>From BETA to FINAL V3
§change 2004-04-09<date>Tomaž Erjavec<name>Added Serbian
§change 2004-02-27<date>Tomaž Erjavec<name>Harmonised with TELRI/cesDoc corpus.
§change 2003-02-11<date>Tomaž Erjavec<name>Conversion to TEI P4 XML
§change 2001-03-19<date>Tomaž Erjavec<name>Modifications to teiHeaders; new MSD library
§change 2000-10-30<date>Tomaž Erjavec<name>Conversion to TEI, initial teiHeader