TEI Header

^§file description

^§title statement

MULTEXT-East word-level annotated multilingual corpus: Nineteen Eighty-Four

^§statement of responsibility

^§name	Tomaž Erjavec, IJS
^§responsibility	TEI encoding

^§statement of responsibility

^§name	Nancy Ide, Vassar
^§responsibility	English data

^§statement of responsibility

^§name	Dan Tufiş, RACAI
^§responsibility	Romanian data

^§statement of responsibility

^§name	Heiki-Jaan Kaalep, TU
^§responsibility	Estonian data

^§statement of responsibility

^§name	Csaba Oravecz, HAS
^§responsibility	Hungarian data

^§statement of responsibility

^§name	Vladimír Petkevič, ITCL
^§responsibility	Czech data

^§statement of responsibility

^§name	Ludmila Dimitrova, BAS
^§responsibility	Bulgarian data

^§statement of responsibility

^§name	Cvetana Krstev, Duško Vitas
^§responsibility	Serbian data

^§statement of responsibility

^§name	Tomaž Erjavec, IJS
^§responsibility	Slovene data

^§statement of responsibility

^§name	Katerina Zdravkova
^§responsibility	Macedonian

^§statement of responsibility

^§name	Behrang QasemiZadeh
^§responsibility	Persian data

^§statement of responsibility

^§name	Natalia Kotsyba
^§responsibility	Polish data

^§statement of responsibility

^§name	Radovan Garabik
^§responsibility	Slovak data

^§funding body

EU Copernicus Project COP106 "MULTEXT-East"

^§funding body

EU Copernicus Concerted Action "TELRI"

^§funding body

EU Copernicus Project PL96-1142 "Concede"

^§funding body

EU Capacities Project GA 211938 "MondiLex"

^§funding body

Individual partners' grants and contracts

^§edition statement

^§edition

MULTEXT-East, Version 4

^§extent

^§measure
type = words

968,354 word tokens

^§publication statement

^§distributor

MULTEXT-East Web site

^§address

http://nl.ijs.si/ME/V4/

^§distributor

Individual partners, c.f. component headers

^§availability

Available for research purposes upon receipt of agreement. In published work based on this resource please cite the appropriate publication from the home page of the project.

^§source description

^§fully-structured bibliographic citation

^§title statement

^§title	Multext-East/Concede: Nineteen Eighty-Four, Multilingual
^§funding body	EU Copernicus Project PL96-1142 "Concede"
^§funding body	EU Copernicus Project COP106 "MULTEXT-East"
^§funding body	Individual partners' grants and contracts.

^§edition statement

^§edition

Version 3

^§publication statement

^§distributor

MULTEXT-East Web site

^§address

http://nl.ijs.si/ME/V3/

^§availability

Available for research purposes upon receipt of signed agreement.

^§date
when = 2004-05-10

2004-05-10

^§source description

^§fully-structured bibliographic citation

title statement

title	Multext-East cesAna: Nineteen Eighty-Four
funding body	EU Copernicus Project COP106 "MULTEXT-East"
funding body	EU Copernicus Action "TELRI"

edition statement

edition

MULTEXT-East Final Release

publication statement

distributor

TRACTOR: TELRI Research Archive of Computational Tools and Resources

publication place

"East meets West" CD-ROM, ISBN 3-922641-46-6

distributor

MULTEXT-East Web site

address

http://nl.ijs.si/ME/CD/

date
when = 1998-01-01

January 1st, 1998

source description

citation list

structured bibliographic citation

monographic level

title

1984

author

George Orwell

imprint

date	1949; reprinted 1961
publisher	New American Library
publication place	New York

^§encoding description

^§project description

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. _<pointer>

^§editorial practice declaration

^§correction principles	Since the CD-ROM release of the 1984s, many errors of linguistic annotations have been corrected in the individual texts. In the process of conversion to TEI, various format errors were detected and corrected.
^§normalization	The novels have their markup normalised: a) structure annotation with div and p (attributes xml:id and type); b) segmentation annotation with s (attribute xml:id); c) tokenisation annotation with w, c, (attribute type) d) linguistic annotation with w attributes lemma and ana. All the novels use UTF-8 encoding.
^§quotation form = unknown	quote elements have been in general changed to P q markup has been in some novels (see individual Headers) omitted, while it is in others present as quote.
^§segmentation	Segmentation into paragraphs follows the printed sources; it therefore not 1-1 with the English original. Segmentation into sentences was performed automatically and then hand-validated. Tokenisation into words and punctuation symbols was perfumed on the basis of MULTEXT-East lexica, mostly with the MULTEXT tools 'mtseg' and then hand-validated.
^§hyphenation	No end-of-line hyphenation present in texts.
^§interpretation	The linguistic interpretation of the text consists of marking up the word tokens with their context disambiguated lemma and MULTEXT-East morphosyntactic description. The various texts have undergone various amounts of validation, so error-rates between them differ.
^§standard values	The two-letter language codes follow ISO 639. The MULTEXT-East morphosyntactic descriptions (MSDs) follow the revised common tables of lexical specifications MULTEXT-East/Mondilex. The lexical MSDs have been converted to a fslib, a feature-structure library, while their decomposition into features is given in a flib, a feature library. The words in the texts have theirs MSD encoded as the value of the ana (#IDREF) attribute. This attribute refers to a fs, which, in turn, refers via its #IDREFS feats to the f elemetns that define it.

^§text-profile description

^§language usage

^§language ident = bg	Bulgarian
^§language ident = cs	Czech
^§language ident = en	English
^§language ident = et	Estonian
^§language ident = hr	Croatian
^§language ident = hu	Hungarian
^§language ident = mk	Macedonian
^§language ident = pl	Polish
^§language ident = ro	Romanian
^§language ident = ru	Russian
^§language ident = sh	Serbo-Croatian
^§language ident = sk	Slovak
^§language ident = sl	Slovene
^§language ident = sl-rozaj	Resian (dialect of Slovene)
^§language ident = sr	Serbian
^§language ident = uk	Ukrainian

^§revision description

^§change	2010-05-09_<date>Tomaž Erjavec_<name>Conversion to MULTEXT-East V4 / TEI P5.
^§change	2004-05-10_<date>Tomaž Erjavec_<name>From BETA to FINAL V3
^§change	2004-04-09_<date>Tomaž Erjavec_<name>Added Serbian
^§change	2004-02-27_<date>Tomaž Erjavec_<name>Harmonised with TELRI/cesDoc corpus.
^§change	2003-02-11_<date>Tomaž Erjavec_<name>Conversion to TEI P4 XML
^§change	2001-03-19_<date>Tomaž Erjavec_<name>Modifications to teiHeaders; new MSD library
^§change	2000-10-30_<date>Tomaž Erjavec_<name>Conversion to TEI, initial teiHeader