This document is a HTML 3.2 rendering of a
Corpus Encoding Specification
DTD document, produced in the scope of the
MULTEXT-East
project, by
Fred.
Note that this HTML translation does not contain all the information from the cesHeader.
CES header
Creator: VP
Created: 1996-04-20
Updated: 1997-09-25
File Description
- Title Statement
- Title:
- Multext-East CES1: Nineteen Eighty-Four, Czech
- Responsibility
-
Vladimír Petkevič
(
Checked and modified markup for correctness down
to the subparagraph level
)
Greg Priest-Dorman
(
Added tagging of sentences in paragraphs using MtSgml and
Czech resources.
)
- Edition:
- MTE Final Release
- Extent:
- 80366 words
1230804 bytes
- Publication Statement
- Distributor:
-
Institute of Theoretical and Computational Linguistics,
Faculty of Philosophy, Charles University, Czech Republic
(UTKL FFUK)
- Address:
-
Celetná 13,
Prague, Czech Republic
- Electronic address:
-
vladimir.petkevic@ff.cuni.cz
- Electronic address:
-
ucnk.ff.cuni.cz directory: pub/corpora/ME
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- October 1, 1997
- Source Description
- Full Bibliography
- Title Statement
- Title:
-
Electronic form of 1984 by George Orwell
in Czech, obtained via OCR
- Responsibility
-
Vladimír Petkevič
Institute of Theoretical and Computational Linguistics,
Faculty of Philosophy, Charles University, Czech Republic
(UTKL FFUK)
(
OCR'ed the novel
)
- Publication Statement
- Distributor:
-
Institute of Theoretical and Computational Linguistics,
Faculty of Philosophy, Charles University, Czech Republic
- Address:
-
Celetná 13, Praha 1
Czech Republic
- Availiability:
-
Available for research purposes only
- Publication date:
-
May 1, 1996
- Source Description
- Structured Bibliography
- Monography
- Title:
-
1984
- Author:
-
George Orwell
- Imprint
- Publication date:
-
1991
- Publisher:
-
Naše vojsko
- Place:
-
Prague, Czech Republic
Encoding Description
- Project Description:
-
MULTEXT-East:
Multilingual Text Tools and Corpora for Central and
Eastern European Languages.
EU Copernicus Project COP106
- Tag declaration:
- abbr = 23
- body = 1
- date = 39
- div = 28
- foreign = 91
- head = 1
- hi = 75
- item = 4
- l = 33
- list = 1
- mentioned = 244
- name = 2181
- note = 2
- num = 48
- p = 1285
- poem = 11
- ptr = 1
- q = 2208
Q tags with a attribute of "type=MI" have been inserted
automatically after S insertion.
- quote = 36
- s = 6714
S tags have been inserted automatically and then cleaned up by hand in
the locations (character offsets) provided by MTSeg version 1.3.1
using the Czech resource files.
- term = 2
- text = 1
- title = 45
Revision Description
- Date: 1996-05-03
Vladimír Petkevič, UTKL FFUK
-
1) Corrected the header, so it better corresponds to
CES recommendations
2) Fixed n, id values in DIVs
3) mdash entity is now used only for sentential
punctuation
- Date: 1996-10-21
Vladimír Petkevič, UTKL FFUK
-
1) Marked up down the subparagraph level according to
the CES canonical markup of the English version
2) Corrected the header so as to meet the
requirements imposed by creating the corpus
containing all corpus components as one SGML
document
- Date: 1997-02-24
Vladimír Petkevič
- Changed IDs, PREV and NEXT attributes previously
using "1984cs" to "Ocs"
-
Converted words and sentences in capital letters into the small
letters
- Corrected broken quotes
- Erased some redundant rendition information
- Corrected and updated the corpus according to the
changes specified in mte1984-en.ces.V1.1.CHANGES
- Ensured more text readability
- fixed some typos in the text
- updated BYTECOUNT and WORDCOUNT
- Date: 1997-03-20
Tomaz Erjavec, IJS
- Normalisation of corpus component CESHEADER elements:
CESHEADER, EDITIONSTMT, TITLESTMT/H.TITLE
- ISO LANGUAGEs implemented as marked section PUBLIC ent
- Language (WSDs) implemented as PUBLIC entities
- Newspeak LANGUSAGE/LANGUAGE IDs now ns-xx for lang xx
- Now every QUOTE in 1984 has at least one P
- Date: 1997-04-02
Greg Priest-Dorman
- inserted S tags in the locations given by MtSeg
-
inserted Q and HI tags where necessary as a result of
S tag insertion
- updated TAGUSAGE
- Date: 1997-05-12
Vladimír Petkevič
- corrected some minor errors caused by wrong MtSeg segmentation
- corrected some typos as revealed by segmentation
- adjusted some paragraphs to conform with the English canonical
version for the sake of sentence alignment
- adjucted tagusage, wordcount and bytecount info
- Date: 1997-06-18
Vladimír Petkevič
- corrected 2 typos
- Date: 1997-07-24
Vladimír Petkevič
- added 2 words
- Date: 1997-09-25
Tomaž Erjavec
- Changed editionStmt, Extent, pubDate, Availability
to final form