This document is a HTML 3.2 rendering of a
Corpus Encoding Specification
DTD document, produced in the scope of the
MULTEXT-East
project, by
Fred.
Note that this HTML translation does not contain all the information from the cesHeader.
CES header
Creator: CO
Created: 1996-04-22
Updated: 1997-09-25
File Description
- Title Statement
- Title:
- Multext-East CES1: Nineteen Eighty-Four, Hungarian
- Responsibility
- Csaba Oravecz
(CES1 conformant tagging)
Greg Priest-Dorman
(
Added tagging of sentences in paragraphs using MtSgml and
Hungarian resources.
)
- Edition:
- MTE Final Release
- Extent:
- 81167 words
1270210 bytes
- Publication Statement
- Distributor:
-
Research Institute for Linguistics, Hungarian Academy of Sciences
- Address:
-
Budapest, Színház u. 5-9.
- Electronic address:
-
- Electronic address:
-
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- October 1, 1997
- Source Description
- Structured Bibliography
- Monography
- Title:
- 1984
- Author:
- George Orwell
- Imprint
- Publication date:
- 1989
- Publisher:
- Európa Könyvkiadó
- Place:
- Budapest
Encoding Description
- Project Description:
-
MULTEXT-East:
Multilingual Text Tools and Corpora for Central and
Eastern European Languages.
EU Copernicus Project COP106
- Tag declaration:
- abbr = 38
- body = 1
- date = 39
All dates which contain one or more digits are
marked, including dates specifying day/month/year and dates consisting
only of a year.
- div = 28
- foreign = 43
The Newspeak words "gondolatbűn" (thoughtcrime), "gondolatbűnöző"
(thoughtcriminal) and "duplagondol" (doublethink) are consistently
marked as FOREIGN, when they do not appear in some other tag where the
lang attribute provides the language information. Latin and French
words are also marked.
- head = 5
- hi = 71
The highlighting tag is used to mark words and phrases which were
typographically distinguished in the printed version of the text, and
for which no other more precise tag is applicable.
- item = 4
- l = 32
- list = 1
- mentioned = 218
- name = 1843
Frequently occurring names of people, places, organizations,
products, languages, and events, are marked.
- note = 2
- num = 10
- p = 1292
- poem = 10
- ptr = 2
- q = 2197
The Q tag is used to mark quoted dialogue. The attribute
"type=indirect" is used when attributed speech is marked
typographically in the printed text. The attribute "type=written" is
used in those cases where Winston's writing in his diary is
represented as quoted thought. If no "rend" attribute is provided on
the Q tag, the value is assumed to be "PRE mdash POST mdash". Except
for the second section of the broken Q tag (see below) in which case
no rendition on the tag indicates lack of typographical marking in the
text, while if there is typographical marking it is explicitly given
in the "rend" attribute rend on the tag.
The attribute "broken=yes" is used when no sentence
terminating punctuation (either inside the Q itself or in the
intervening text between two Qs) appears between two dialogue
fragments by the same speaker. Q tags with a attribute of "type=MI"
have been inserted automatically after S insertion.
- quote = 35
QUOTE marks quotations from outside sources, including extensive
quotations from Winston's diary and Goldstein's treatise.
- s = 6732
S tags have been inserted automatically and then cleaned up by hand in
the locations (character offsets) provided by MTSeg version 1.3.1
using the Hungarian resource files.
- text = 1
- title = 40
Revision Description
- Date: 10/25/1996
Csaba Oravecz
- modified header according to ME template
- Date: 10/31/1996
Csaba Oravecz
- revised paragraph level marking in accordance with the English version
- Date: 02/25/1997
Csaba Oravecz
- revised paragraph level marking in accordance with the English version
- marked broken Q tags and linked them
with PREV and NEXT attributes
- updated BYTECOUNT
- Date: 1997-03-20
Tomaz Erjavec, IJS
- Normalisation of corpus component CESHEADER elements:
CESHEADER, EDITIONSTMT, TITLESTMT/H.TITLE
- ISO LANGUAGEs implemented as marked section PUBLIC ent
- Language (WSDs) implemented as PUBLIC entities
- Newspeak LANGUSAGE/LANGUAGE IDs now ns-xx for lang xx
- Now every QUOTE in 1984 has at least one P
- Date: 1997-04-02
Greg Priest-Dorman
- inserted S tags in the locations given by MtSeg
-
inserted Q and HI tags where necessary as a result of S
tag insertion
- updated TAGUSAGE
- Date: 1997-04-10
Csaba Oravecz
-
inserted dummy DIV element in 2nd part where the English has the
10th chapter, to conform to the English version for the
sake of alignment
- updated TAGUSAGE and BYTECOUNT
- updated IDs
- Date: 1997-05-08
Csaba Oravecz
-
corrected two mistakes in encoding: inserted one P tag and deleted one P tag
- Date: 1997-05-15
Csaba Oravecz
- corrected sentence segmentation errors
- updated TAGUSAGE and BYTECOUNT
- updated IDs
- Date: 1997-06-08
Csaba Oravecz
- corrected sentence segmentation errors revealed by sentence level alignment
- corrected order of sentences to comform to English version in part 2, ch. 1, paragraph 37., listed in correction
- updated TAGUSAGE and BYTECOUNT
- updated IDs
- Date: 1997-09-11
Csaba Oravecz
- corrected one sentence segmentation error revealed by sentence level alignment
- updated TAGUSAGE and BYTECOUNT
- updated IDs
- Date: 1997-09-25
Csaba Oravecz
- small typing corections
- updated BYTECOUNT
- Date: 1997-09-25
Tomaž Erjavec
- Changed editionStmt, byteCount, Availability to final form