This document is a HTML 3.2 rendering of a
Corpus Encoding Specification
DTD document, produced in the scope of the
MULTEXT-East
project, by
Fred.
Note that this HTML translation does not contain all the information from the cesHeader.
CES header
Creator: NMI
Created: 1995-05-10
Updated: 1997-09-25
File Description
- Title Statement
- Title:
-
Nineteen Eighty-Four (English);
Multext-East CES1 version
- Responsibility
- Nancy Ide
(
Modified ECI tags of first chapter to conform to CES
Added or modified some sub-paragraph level tagging.
)
- Responsibility
- Tomaz Erjavec
(Modified full ECI Orwell to conform to CES V3.15)
- Responsibility
- Greg Priest-Dorman
(
Modified Tomaz Erjavec's full Orwell to conform to CES V3.21
Checked and modified markup for correctness down to the paragraph level
)
- Responsibility
- Greg Priest-Dorman
(
Added tagging of sentences in paragraphs using MtSeg and english resources.
)
- Edition:
- MTE Final Release
- Extent:
- 104302 words
928986 bytes
Note:
WordCount represents the number of words in this text exclusive
of tags and header information. ByteCount reflects the approximate
size of the file containing the doctype and cesDoc element including
all text, tags and header information.
- Publication Statement
- Distributor:
- Laboratoire Parole et Langage
- Address:
- 29, avenue Robert Schuman, Aix-en-Provence, France
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- October 1, 1997
- Source Description
- Full Bibliography
- Title Statement
- Title:
-
The European Corpus Initiative
Multilingual Corpus 1:
1984 by George Orwell (English)
- Responsibility
- Association for Computational Linguistics
(Converted from OTA's DTD to ECI DTD)
- Publication Statement
- Distributor:
- ACL
- Address:
- ACL
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- 1994
- Source Description
- Full Bibliography
- Title Statement
- Title:
- Orwell's 1984: electronic edition
- Responsibility
- Oxford Text Archive
(
The four versions of Orwell's 1984 in the OTA were all prepared by the
OUCS KDEM service in 1985 for Dr David C Bennett of the School of
Oriental And African Studies at London University. The texts here
have not been encoded or proofread in any way since they were produced
(other than the English text, which was converted to an SGML like
encoding by John Price-Wilkin, and subsequently automatically
converted to conform to the OTA's dtd by myself and Alan Morrison. The
other languages were converted to TEI conformant SGML by the ECI
project 1993.) ----LB, Nov 1992
)
- Edition:
-
Public Domain TEI edition prepared at the Oxford Text Archive
- Publication Statement
- Distributor:
- Oxford Text Archive
- Address:
-
Oxford University Computing Service
13 Banbury Road
Oxford OX2 6NN UK
archive@ox.ac.uk
- Availiability:
-
Freely available for non-commercial use provided that this header is
included in its entirety with any copy distributed
- Publication date:
- 19 Nov 1992
- Source Description
- Structured Bibliography
- Monography
- Title:
- Nineteen Eighty Four
- Imprint
- Publication date:
- 1949; reprinted 1961
- Publisher:
- New American Library
- Place:
- New York
Encoding Description
- Project Description:
-
This English version of Orwell's 1984 is encoded conformant to level 1
specifications of the Corpus Encoding Standard for the MULTEXT-EAST
project. The English is to serve as the base for the parallel corpus,
which will include aligned versions of the text in Romanian,
Bulgarian, Estonian, Slovenian, Czech, and Hungarian.
- Tag declaration:
- abbr = 38
Abbreviations are marked only within marked names. Other abbreviations
are not marked.
- body = 1
- date = 40
All dates which contain one or more digits (the characters 0-9) are
marked, including dates specifying day/month/year and dates consisting
only of a year. No attempt was made to identify or mark dates in other forms.
- distinct = 1
- div = 28
- foreign = 39
The Newspeak words "thoughtcrime" and "doublethink" are consistently
marked as FOREIGN, when they do not appear in some other tag where the
lang attribute provides the language information. Latin and French
words are also marked.
- head = 1
- hi = 103
The highlighting tag is used to mark words and phrases which were
typographically distinguished in the printed version of the text, and
for which no other more precise tag is applicable. In most of these
cases, such highlighting signifies emphasis.
- item = 4
- l = 32
- list = 1
- mentioned = 261
Rendition information has not been
systematically retained. When no rendition information is provided,
rendering is generally in italics in the 1949 Harcourt, Brace and World
Edition of Ninteeen Eighty-Four. The original
electronic version contained rendition information inconsistent with
the 1949 Harcourt edition.
- name = 1744
Frequently occurring names of people, places, organizations,
products, languages, and events, are marked. If a name is marked, every
occurrence of that name is marked.
Person names in the genitive are not marked to include the English genitive
suffix "'s". For other names, only those occurrences which function as
stand-alone proper nouns are marked; adjectival uses (e.g., "Newspeak
words") are not marked.
- note = 2
- num = 52
Anything containing one or more digits (the characters 0-9) that is
not part of a date, and all roman numerals, are marked as a
number. In cases where a ratio is expressed (per cent, per thousand),
the entire phrase (e.g., "10 per cent") is marked as a number.
- p = 1286
- poem = 10
- ptr = 2
- q = 2209
The Q tag is used to mark quoted dialogue. The attribute
"type=indirect" is used when attributed speech is marked
typographically in the printed text (e.g., "I know you," he seemed to
say). The attribute "type=written" is used in those cases where
Winston's writing in his diary is represented as quoted thought (e.g.,
"If there is hope," he wrote, "it lies in the Proles."). If no "rend"
attribute is provided on the Q tag, the value is assumed to be "PRE
ldquo" on the first Q in a series of Qs within the same P unbroken by
#PCDATA and "POST rdquo" on the last Q in the series. The attribute
"broken=yes" is used when no sentence terminating punctuation (either
inside the Q itself or in the intervening text between two Qs) appears
between two dialogue fragments by the same speaker.
- quote = 35
QUOTE marks quotations from outside sources, including extensive
quotations from Winston's diary and Goldstein's treatise.
- s = 6701
S tags have been inserted automatically and then cleaned up by hand
in the locations (character offsets) provided by MTSeg version 1.3.1
using the english resource files.
- text = 1
- title = 46
Rendition information has not been
systematically retained. The original
electronic version contained rendition information inconsistent with
the 1949 Harcourt, Brace and World edition.
Revision Description
- Date: 9/5/96
Tomaž Erjavec, IJS
- Corrected the chapter 1 (esp header) to CES V2 conformance
-
with spelling cheker corrected a number of original OCR typos:
I instead of l, rn instead of m
- inserted Qs
- inserted some missing apostrophes
- changed '. . .' to '...', ' !' to '!', ' ?' to '?'
-
changed a number of GIs, as CES does not support ECI ones:
EMPH to HI
MENTION to MENTIONED and removed punctuation on single words therein
GLOSS to TERM (best I could come up with, without loosing distinction)
- Date: 14/5/96
Tomaž Erjavec, IJS
- Deleted apostrophes from chapter 2 and onwards
- Changed some TERM into FOREIGN
- Date: 14/7/96
Greg Priest-Dorman
- Changed dashes to entity mdash (not complete)
- Added additional q tags where appropriate
- Added quote tages
- Changed q tags to quote tags where appropriate
- All quotation marks repalced with markup
- Replaced q tags with mentioned tags where appropriate
- Standardized the markup of poems in the text
-
Marked broken Q tags as such (linking of broken Q tags with
next and prev attributes is not yet done)
- Date: 15/09/96
Greg Priest-Dorman
- linked broken Q tags with "prev" and "next" attributes
-
all occurrences of "..." and ". . ." have been replaced with the
ISO_8879:1986 Publishing entity "hellip"
-
changes of P and QUOTE tags since version .3 logged in file
p.and.quote.changes, available on request
- names tagged with NAME as stated above in TAGUSAGE "gi=name"
-
quoted text tagged as stated above in TAGUSAGE "gi=q" and TAGUSAGE "gi=quote"
-
dates and numbers tagged as stated above in TAGUSAGE "gi=num" and
TAGUSAGE "gi=date"
-
abbreviations are tagged as stated above in TAGUSAGE "gi=abbr"
-
OCR errors have been corrected when found, most noticeably, the "p"
at the beginning of "Party" was usually incorrectly in lower case.
-
"rend" if added has been checked against the 1949 Harcourt,
Brace & World, Inc. edition of Nineteen Eighty-Four
- Date: 15/01/97
Greg Priest-Dorman
- Changed IDs, PREV and NEXT attributes using "1984en" to "Oen"
-
Fixed tagging error in Part 1 Chapter 4 QUOTE 2
(see mte1984-en.ces.V1.1.CHANGES) and reduced TAGUSAGE for P by 2
- fixed some typos in the header
- replaced any tab(^I) characters in the text (there was one)
- reformated the text for readability and consistency
- updated BYTECOUNT
- Date: 03/03/97
Greg Priest-Dorman
-
Corrected markup: marked broken Qs part 1 chapter 8 paragraph 3
(pointed out by O. Csaba).
-
Corrected markup: Part 1 chapter 4, in the list of newspeak quotes
from the times part of the last list item was not in the list, it
is now (pointed out by T. Erjavec)
- corected punctuation error: Part 1 chapter 4, on two occasions
the newspeak quote which ends "fullwise upsub antefiling" occurs. In
the printed edition this is followed by a period, so I added the period.
- Date: 30/04/97
Greg Priest-Dorman
- inserted S tags in the locations given by MtSeg
-
inserted Q and HI tags where necessary as a result of S tag insertion
- Date: 12/05/97
Greg Priest-Dorman
-
Corrected several tagging errors pointed out by T. Erjavec and V. Petkevic
- modifed header to comply with T. Erjavec's header style
- updated TAGUSAGE
- removed blank lines
- Date: 14/05/97
Greg Priest-Dorman
- added Ss to two newspeak paragraphs to aid in alignment
- updated TAGUSAGE
- Date: 19/05/97
Greg Priest-Dorman
-
Corrected several tagging errors pointed out by T. Erjavec
-
Corrected several typos in the text pointed out by T. Erjavec and V. Petkevic
- updated TAGUSAGE
- Date: 20/06/97
Greg Priest-Dorman
-
Corrected several tagging errors pointed out by Vladimir Petkevic
where a sentence boundry was inserted 2 characters ahead of where
it should have been.
- Date: 1997-09-25
Toma;ž Erjavec
- Changed editionStmt, byteCount, pubDate, Availability
to final form