This document is a HTML 3.2 rendering of a Corpus Encoding Specification DTD document,
by Fred, using the ceshdr2html_tmap.fred translation map.
Note that this HTML translation does not contain all the information
from the original document.
Uses ISO 8859-1 (Latin-1) encoding.
CES header
Version: 4.1, Type: text, Language: en,
Creator: ET, Status: update, Created: 1997-12-15, Updated: 1997-12-20
File Description
- Title Statement
- Title:
- Multext-East cesAna: Nineteen Eighty-Four, English
- Responsibility
- Nancy Ide
(Overall Responsibility)
Greg Priest-Dorman
(Generation of Lexical Data)
Vladimír Petkevic
(Conversion to cesAna DTD)
- Edition:
- MTE Final Release
- Extent:
- 103997 words, cca 38 MB
Note: wordCount represents the number of TOK TYPE=WORD
elements in the text.
- Publication Statement
- Distributor:
-
Department of Computer Science, Vassar College
- Address:
- Poughkeepsie, New York 12604-0252 USA
- Electronic address:
- email: ide@cs.vassar.edu
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- January 1st, 1998
- Source Description
- Full Bibliography
- Title Statement
- Title:
- Multext-East CES1: Nineteen Eighty-Four, English
- Publication Statement
- Distributor:
-
Department of Computer Science, Vassar College
- Address:
- Poughkeepsie, New York 12604-0252 USA
- Electronic address:
- email: ide@cs.vassar.edu
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- October 1, 1997
- Source Description
- Full Bibliography
- Title Statement
- Title:
-
The European Corpus Initiative
Multilingual Corpus 1:
1984 by George Orwell (English)
- Responsibility
- Association for Computational Linguistics
(Converted from OTA's DTD to ECI DTD)
- Publication Statement
- Distributor:
- ACL
- Address:
- ACL
- Availiability:
-
Available for research purposes upon receipt of signed
agreement
- Publication date:
- 1994
- Source Description
- Full Bibliography
- Title Statement
- Title:
- Orwell's 1984: electronic edition
- Responsibility
- Oxford Text Archive
(
The four versions of Orwell's 1984 in the OTA
were all prepared by the OUCS KDEM service in
1985 for Dr David C Bennett of the School of
Oriental And African Studies at London
University. The texts here have not been
encoded or proofread in any way since they were
produced (other than the English text, which was
converted to an SGML like encoding by John
Price-Wilkin, and subsequently automatically
converted to conform to the OTA's dtd by myself
and Alan Morrison. The other languages were
converted to TEI conformant SGML by the ECI
project 1993.) --LB, Nov 1992
)
- Edition:
-
Public Domain TEI edition prepared at the Oxford Text
Archive
- Publication Statement
- Distributor:
- Oxford Text Archive
- Address:
-
Oxford University Computing Service
13 Banbury Road
Oxford OX2 6NN UK
archive@ox.ac.uk
- Availiability:
-
Freely available for non-commercial
use provided that this header is included in its
entirety with any copy distributed
- Publication date:
- 19 Nov 1992
- Source Description
- Structured Bibliography
- Monography
- Title:
- 1984
- Author:
- George Orwell
- Imprint
- Publication date:
- 1949; reprinted 1961
- Publisher:
- New American Library
- Place:
- New York
Encoding Description
- Project Description:
-
MULTEXT-East:
Multilingual Text Tools and Corpora for Central and Eastern
European Languages.
EU Copernicus Project COP106
- Editorial declaration:
- Transduction:
-
In the cesDoc to cesAna conversion, DIV, QUOTE, Q tags and
HEAD, POEM, LIST elements have been omitted. cesDoc P
elements are encoded as PAR, and S as S.
cesDoc sub-S level tags are omitted: DATE, NAME, ABBR, etc.
- Quotation:
-
Q and QUOTE tags from the cesDoc source not retained.
- Segmentation:
-
S segmentation same as in cesDoc source (hand-validated).
TOK segmentation performed with mtseg and manually corrected,
- Tag declaration:
- chunklist = 1
-
Element corresponds to TEXT of the cesDoc source
- chunk = 1
-
Element corresponds to BODY of the cesDoc source
- par = 1286
-
Elements correspond to P elements of the cesDoc source.
The FROM attribute gives the reference to the ID of the
corresponding cesDoc P element.
- s = 6701
-
Elements correspond to S elements of the cesDoc source
The FROM attribute gives the reference to the ID of the
corresponding cesDoc S element.
- tok = 118102
-
Tokens are of TYPE=WORD or PUNCT, with the CLASS attribute
giving the mtseg class of the token (ABBR, COMP, INIT, TTL).
The FROM attribute gives reference to the ID of the
corresponding cesDoc S element in which the token in
question appears along with the character offset of the
token within the sentence (the character offset is appended
to the sentence ID).
- orth = 118102
-
Contains the orthography of the token, as found in the cesDoc source
(except for COMP, which have underscore instead of blank).
- disamb = 187526
-
Contains disambiguated lexical information for
WORDs. Disambiguation performed by Eric Brill's Unsupervised
Part-of-Speech Tagger Version 0.8. Trained on chapters 1&2
of Multext-East CES1: Nineteen Eighty-Four, English. A token
with several DISAMBs indicates that the Brill Tagger was
not able to fully disambiguate the token. In such a case
all equally-weighted possibilities are listed.
- lex = 214404
-
Contains undisambiguated lexical information for WORDs.
- base = 401930
-
Base or lemmma of a WORD. In the event that the base of
the WORD was not known, the content of this tag will be
"??" (two question marks).
- msd = 401930
-
Morphosyntactic description of a WORD. In the event that
the MSD of the WORD was not known, the content of this tag
will be "??" (two question marks).
- ctag = 416035
-
Corpus tag (for tok type=WORD and for tok type=PUNCT).
In the event that the CTAG of the WORD was not known, the
content of this tag will be "??" (two question marks).
Revision Description
- Date: 1997-12-16 (Vladim;ír Petkevic)
- Revised several tagusage descriptions, and supplied
counts in the header.
- Date: 1997-12-20 (Tomaz Erjavec, IJS)
-
Meta-Made by et