This document is a HTML 3.2 rendering of a
Corpus Encoding Specification
DTD document, produced in the scope of the
MULTEXT-East
project, by
Fred.
Note that this HTML translation does not contain all the information from the cesHeader.
CES header
Creator: ET
Created: 1996-04-18
Updated: 1997-09-25
File Description
- Title Statement
- Title:
- Multext-East CES1: Nineteen Eighty-Four, Slovene
- Responsibility
- Tomaž Erjavec
(Error correction and CES1 conformance.)
Olga Vuković
(
Up-translation of ECI version to CES1 V2.0 conformance,
using the printed edition as the reference
proof-reading the text
)
Greg Priest-Dorman
(
Added tagging of sentences in paragraphs using MtSgml and
Slovene resources.
)
- Edition:
- MTE Final Release
- Extent:
- 91619 words
945857 bytes
Note:
WordCount represents the number of words in this text exclusive
of tags and header information. ByteCount reflects the
size of the file containing the doctype and cesDoc element including
all text, tags and header information.
- Publication Statement
- Distributor:
-
Dept. for Intelligent Systems,
Jozef Štefan Institute,
- Address:
- Jamova 39, SI-1000 Ljubljana, Slovenia
- Electronic address:
- tomaz.erjavec@ijs.si
- Electronic address:
- http://nl.ijs.si/ME
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- October 1, 1997
- Source Description
- Full Bibliography
- Title Statement
- Title:
-
The European Corpus Initiative
Multilingual Corpus 1:
1984 by George Orwell (Slovene)
- Responsibility
- Association for Computational Linguistics
(Converted from OTA's DTD to ECI DTD)
- Publication Statement
- Distributor:
- ACL
- Address:
- ACL
- Availiability:
-
Available for research purposes upon receipt of signed
agreement
- Publication date:
- 1994
- Source Description
- Full Bibliography
- Title Statement
- Title:
- Orwell's 1984: electronic edition
- Responsibility
- Oxford Text Archive
(
The four versions of Orwell's 1984 in the OTA
were all prepared by the OUCS KDEM service in
1985 for Dr David C Bennett of the School of
Oriental And African Studies at London
University. The texts here have not been
encoded or proofread in any way since they were
produced (other than the English text, which was
converted to an SGML like encoding by John
Price-Wilkin, and subsequently automatically
converted to conform to the OTA's dtd by myself
and Alan Morrison. The other languages were
converted to TEI conformant SGML by the ECI
project 1993.) --LB, Nov 1992
)
- Edition:
-
Public Domain TEI edition prepared at the Oxford Text
Archive
- Publication Statement
- Distributor:
- Oxford Text Archive
- Address:
-
Oxford University Computing Service
13 Banbury Road
Oxford OX2 6NN UK
archive@ox.ac.uk
- Availiability:
-
Freely available for non-commercial
use provided that this header is included in its
entirety with any copy distributed
- Publication date:
- 19 Nov 1992
- Source Description
- Structured Bibliography
- Monography
- Title:
- 1984
- Author:
- George Orwell
- Author:
- Translator: Alenka Puhar
- Imprint
- Publication date:
- 1983
- Publisher:
- Knjižnica Kondor
- Publisher:
- Mladinska knjiga
- Place:
- Ljubljana
Encoding Description
- Project Description:
-
MULTEXT-East:
Multilingual Text Tools and Corpora for Central and Eastern
European Languages.
EU Copernicus Project COP106
- Tag declaration:
- abbr = 26
- body = 1
- date = 33
- div = 28
- foreign = 7
- head = 29
- hi = 242
- item = 4
- l = 34
- list = 1
- name = 1327
- note = 1
- p = 1288
- poem = 10
- ptr = 1
- q = 2260
Q tags with a attribute of "type=MI" have
been inserted automatically after S insertion.
- quote = 35
- s = 6689
S tags have been inserted automatically and then cleaned up
by hand in the locations (character offsets) provided by MTSeg
version 1.3.1 using the Slovene resource files.
- text = 1
- title = 10
Revision Description
- Date: 1996-04-18
Tomaž Erjavec, IJS
- Marked-up to CES1 compliance
- Date: 1996-05-02
Tomaž Erjavec, IJS
-
Corrected the header, to better corresponds to CES recommendations
- Fixed n and id values in DIVs
- Corrected some untagged and mis-tagged NAMEs
- Changed the rend values in accordance with new CES
- Date: 1996-07-17
Tomaž Erjavec, IJS
- New CES1 English version received
changing Slovene accordingly
- made header more similar to Eng one
- Part II, Chp 10 header fixed -
is problematic, and Eng version doesn't have
a chapter here, just an asterix
- Changed approp. Qs into TITLEs,
moved rend from L to POEM
- Date: 1996-08-08
Tomaž Erjavec, IJS
- Word segmentation of 1984 shows some more
(segmentation) typos, e.g. 'nota,je', '0ceanija'; corrected these.
- Made all part and chapter HEADs of the same form
- As CES now supports nested Qs, de-commented those.
- Date: 1996-10-08
Tomaž Erjavec, IJS
- Some more names were NAME tagged
- "sv." was inconsistently capitalised in the book, and
hence in the corpus; this was unformly set to "Sv."
- "Sv." tagged as ABBR and left *inside* NAME (!?)
- Corrected "2+2=" into "2+2=5". Sounds bizarre.
- Date: 1996-10-30
Tomaž Erjavec, IJS
- Prepared from IM3
- Date: 1996-12-07
Tomaž Erjavec, IJS
- Two more typos in the book, first chapter corrected:
"videti vse [poslopja] tri hkrati." (vsa);
"da bi ga bilo moči takoj izbrisati." (moč);
"židinja je sedelo" (sedela)".
- Date: 1996-12-22
Tomaž Erjavec, IJS
- More typos corrected
- Date: 1997-02-06
Tomaž Erjavec, IJS
- Changed all '...' to hellip entity
- Deleted HI REND="IT" where it contained only other elements
and moved REND="IT" into these elements
- Date: 1997-02-18
Tomaž Erjavec, IJS, Tanja, Renata
- Started work on structure aligning with English
version of 17/01/1997; a number of P and QUOTE elements
added or deleted. Thus we loose the original book information,
but it can be argued that the translation was simply wrong
where it did not reflect the structure of the English original.
- Destroying the sancity of the translation!
Alignment shows that the translation is missing P containing:
"The old man brightened suddenly."
This has been inserted as P:
"Starec se je nenadoma razveselil."
- Date: 1997-03-20
Tomaz Erjavec, IJS
- Normalisation of corpus component CESHEADER elements:
CESHEADER, EDITIONSTMT, TITLESTMT/H.TITLE
- ISO LANGUAGEs implemented as marked section PUBLIC ent
- Language (WSDs) implemented as PUBLIC entities
- Newspeak LANGUSAGE/LANGUAGE IDs now ns-xx for lang xx
- Now every QUOTE in 1984 has at least one P
- Date: 1997-04-02
Greg Priest-Dorman
- inserted S tags in the locations given by MtSeg
-
inserted Q and HI tags where necessary as a result of
S tag insertion
- updated TAGUSAGE
- changed "sl1984" to "Osl"
- Date: 1997-05-17
Tomaz Erjavec, IJS
- S element harmonisation with English markup
- The "Svinja!"/Swine! was marked as three Q and three
S in Slovene, only one in English
- Date: 1997-05-18
Tomaz Erjavec, IJS
- P ID=Osl.2.11.33 for some reason not S segmented;
corrected by hand.
- In P ID=Osl.3.5.9 the Qs did not terminated
Ss. Inserted two S here
- Manually re-IDed affected Ps
- Date: 1997-05-19
Tomaz Erjavec, IJS
- More missegmentation fixed
- Date: 1997-06-10
Tomaz Erjavec, IJS
- Tag normalisation (no RE, do dbl SP in tags)
- Date: 1997-06-19
Tomaz Erjavec, IJS
- Corrected some more spelling mistakes
- Changed all caps words into lower case,
and marked them as rend=CA
(mtlex does not find them otherwise)
- Changed hellip ent back into '...';
(mt tools cannot deal with hellip
- Changed mdash ent to '-' in prefixes in appendix:
(pred-, po-, nad-, pod- in Osl.4.8.3)
- Date: 1997-06-23
Tomaz Erjavec, IJS
- Corrected two more typos in 1st Chp
- Date: 1997-07-09
Tomaz Erjavec, IJS
- Final typos corrected;
lexicon now covers all wordforms in text.
- Date: 1997-08-06
Tomaz Erjavec, IJS
- deleted LABEL markup in LIST Osl.1.5.5.1
- updated TAGUSAGE (no LABEL), BYTECOUNT
- Date: 1997-09-08
Tomaz Erjavec, IJS
- Due to a feature of mtlex, lexicon did not cover all
word-forms; a few more typos found and corrected.
- Date: 1997-09-25
Tomaž Erjavec
- Changed editionStmt, byteCount, pubDate, Availability
to final form